Daily Data Cleaning Notes: Awk FTW 4

Posted
June 18th 2019

I was thinking of naming this 3.5 or 3.1 or 3a, since it follows on the last post. But part of the beauty of a readable, not just write-happy scripting language, is that one can pick up code after a break and make sense of it easily. So let’s pretend we are not reading this with great continuity from the last post.

We found test.awk looking like this:

BEGIN { 
  FS = "|"
  while (getline < "all.geo.list" > 0) {
    split($3, geo, /,/) 
    mycount[round(geo[1]) "," round(geo[2])]++
  }
  for (i in mycount) print mycount[i],i | "sort -nr | less"
}
func round(x) { return sgn(x) * int(.5 + abs(x)) } 
func abs(x) { if (x < 0) return -x; return x }
func sgn(x) { if (x < 0) return -1; return 1 

And it’s easy to see, in this rather self-documenting form, that it reads all.geo.list, counts something and sorts by those counts. Well wasn’t this a “throwaway” script? Sure, but there are degrees of throwaway. I am quite happy not to have to rummage around the ls -tl | less or history | less to try to remember what the file was called, especially since the code depends crucially on the format of that rather temporary file.

The first thing I might do if we’re keeping this script is change sgn to sgn1, to indicate to anyone following that this might not be the normal 0, 1, -1 sgn function. It is de rigeur to search for all occurrences of that identifier, sgn, since we aren’t in a modern code development environment that does it automatically for our language. I at least look at them manually (grep if there are too many), if not replace them all manually, since substring matches are always a bummer of a surprise.

Then since I want to know what all.geo.list lines look like, I’ll add a short tail of the data as comments near the while statement. I find this mandatory when working with data, especially cleaning data. Positional references really tend to benefit from such comments, and I learned this way back in my LISP days.

BEGIN { 
  FS = "|"
  while (getline < "all.geo.list" > 0) {
# 7723|Larchmont Chronicle|34.0730556,-118.3238889|{"34.0730556,-118.3238889"}
 # 7724|CityWatch Los Angeles|34.0522342,-118.2436849|{"34.0522342,-118.2436849"}
 # 7789|GoWEHO|34.0900091,-118.3617443|{"34.0900091,-118.3617443"}
    split($3, geo, /,/) 
    mycount[round(geo[1]) "," round(geo[2])]++
  }
  for (i in mycount) print mycount[i],i | "sort -nr | less"
}
func round(x) { return sgn1(x) * int(.5 + abs(x)) } 
func abs(x) { if (x < 0) return -x; return x }
func sgn1(x) { if (x < 0) return -1; return 1 

Yes, showing sample data clutters the code and decreases the elegant-readability, but it really helps the reader-comprehension, yes?

What I really want is to append the titles when I see them to a list associated with the rounded-geo counts. This way I can inspect which publications are located at these places — that’s actually what I’m trying to do, not just pump code or wax pedantic over the production of code.

Building this structure is especially easy in a scripting language like awk because strings and associative arrays are so compliant. We could wish for more data primitives, like a stack, but hey, strings are perfectly good here. Also what my personal history of this kind of stuff has shown to me is that the less syntax, and the fewer library references, the better.

BEGIN { 
  FS = "|"
  while (getline < "all.geo.list" > 0) {
# 7723|Larchmont Chronicle|34.0730556,-118.3238889|{"34.0730556,-118.3238889"}
 # 7724|CityWatch Los Angeles|34.0522342,-118.2436849|{"34.0522342,-118.2436849"}
 # 7789|GoWEHO|34.0900091,-118.3617443|{"34.0900091,-118.3617443"}
    thisid = $1; thistitle = $2; thisgeo = $3 
    if (!thisgeo) continue 
    split(thisgeo, geo, /,/) 
    thisgindex = round(geo[1]) "," round(geo[2]) 
    if (thisgindex == "0,0") continue 
    mysto[thisgindex] = mysto[thisgindex] "," thistitle 
    mycount[thisgindex]++
  }
  for (i in mycount) print mycount[i],i,substr(mysto[i],2) | "sort -nr | less"
}
func round(x) { return sgn1(x) * int(.5 + abs(x)) } 
func abs(x) { if (x < 0) return -x; return x }
func sgn1(x) { if (x < 0) return -1; return 1 

Here, I have put all of the delimited-line parse commands on one line, I’ve renamed and introduced some variables (mainly because I wanted to use foo = foo "," bar as an append-to-delimited-list for output (note the substr(foo,2) on output to remove the first comma — this is pretty common in the strings-as-lists vernacular); sorry that this language doesn’t have a concatenate-at-end operator as far as I know. I’ve also nixed processing (continue) on two conditions, thisgeo testing as 0, or the rounded results testing as 0,0.

Now, technically I don’t need the first test, for an empty field coming out of the psql data. Because splitting empty, adding 0 to each part, and getting 0,0 should suffice to trigger the latter continue statement. In the very, very old days, one would test for that ahead of the assignments to variables needed later (thisd, thistitle) to avoid wasted work. But c’mon. It’s clearer for the next person, that’s for sure, because the downstream test for 0,0 requires some tracing of all the obvious, and non-obvious, possibilities that reach that point.

However, I must concede that the test if (!thisgeo) is purely syntactic simplicity for the beginner. It looks nice, but if for some reason thisgeo==0, or in some older awk implementations, 0,0, or even 0,-100.123, then the thisgeo==0 test could be satisfied. I know this was TOO MUCH implicit “help” back when it was thought a good idea. This is why the newer implementations don’t do that. Implicit type conversion sure, but this turns out to be a good place to go wrong.

Earlier, we might have let it go because we were throwing away the script and we were confident we were working with strings, not strings that might look like integers sometimes. But now that the code is progressing, this is a very good time to tighten things up. A truly disciplined programmer should probably test if (thisgeo != "") all the time, just as while (getline < fname > 0) should include the >0. One can get away without the discipline, when you know, speed of script writing, or brevity, matters. It may even document the data-type expectation in a weirdly non-explicit way. But it’s true that this one needs to have a good habit.

A practical guru strikes a fine balance between simplicity and complexity, while working with expediency vs. discipline tradeoffs.

So this is our full script now:

BEGIN { 
  FS = "|"
  while (getline < "all.geo.list" > 0) {
# 7723|Larchmont Chronicle|34.0730556,-118.3238889|{"34.0730556,-118.3238889"}
 # 7724|CityWatch Los Angeles|34.0522342,-118.2436849|{"34.0522342,-118.2436849"}
 # 7789|GoWEHO|34.0900091,-118.3617443|{"34.0900091,-118.3617443"}
    thisid = $1; thistitle = $2; thisgeo = $3 
    if (thisgeo != "") continue 
    split(thisgeo, geo, /,/) 
    thisgindex = round(geo[1]) "," round(geo[2]) 
    if (thisgindex == "0,0") continue 
    mysto[thisgindex] = mysto[thisgindex] "," thistitle 
    mycount[thisgindex]++
  }
  for (i in mycount) print mycount[i],i,substr(mysto[i],2) | "sort -nr | less"
}
func round(x) { return sgn1(x) * int(.5 + abs(x)) } 
func abs(x) { if (x < 0) return -x; return x }
func sgn1(x) { if (x < 0) return -1; return 1 

And of course the output looks something like this:

25 38,-122 ktvu,fox40,kron4,Oakland City Council Agenda,Oakland City Council Minutes,sacramento.cbslocal,San Francisco BOS and Committees Minutes,San Francisco BOS and Committees Agendas,abc10,The San Francisco Examiner,Bay Area Reporter,SF Bay,Argus,Ark,Berkeley Daily Planet,Daily Review,East Bay Express,Independent,InsideBayArea,Napa Valley Register,San Mateo County Times,Sonoma Index Tribune,Tri Valley Herald,Public Policy Institutre of California,Scott Wiener
19 42,-71 Boston WHDH 7News,wcvb,CBS Boston,whdh,Austin Inno,Rhode Island Inno,wpri,abc6,Christian Science Monitor,Dorchester Reporter,Foxboro Reporter,MetroWest Daily News,Patriot Ledger,Sun Chronicle,Walpole Times,East Bay Newspapers,Pawtucket Times,Providence Journal Bulletin,Warwick Beacon
16 42,-73 wfsb,wwlp,Bristol Press,Hartford Courant,Journal Inquirer,Litchfield County Times,MyRecordJournal com,New Herald,Newington Town Crier,The Middletown Press,The Register Citizen,Waterbury Republican American,West Hartford News,Berkshire Eagle,Daily Hampshire Gazette,Union News
16 39,-77 Georgetown Voice,baltimore.cbslocal,fox5dc,wjla,DC Inno,Fox5DC,Baltimore Chronicle,Capital Gazette,Dundalk Eagle,Frederick News Post,New Bay Times,Washingtonian,Times Community Newspapers,Daily Record,Prince George s Sentinel,Times Community Newspapers

which is just what I wanted to see.

Inspect your data in order to transform it. Use a powerful data-inspecting tool, a string-oriented scripting language, to have agile and novel, custom, fast, practical, and insightful inspection of data. And remember that the task is more important than the programming, so keep in mind what you’re trying to do, keep your discipline for correctness, and keep vigilant of building unnecessary abstractions that might just make things harder for the next person, or worse, that are meaningless misuse of precious programmer time. Especially if there is no next person. Especially if I’m paying for the programming time.

So many castles built in sand when people take middleware and gui programming practices into data cleaning scripting.