Daily Data Cleaning Notes: Awk FTW 3

Posted
June 17th 2019

Today’s task is simple — just a little math on geo coordinates to pull out the locations that have the most sources, in order.

The starting place is a file like this, output from a psql SELECT query a while back. Could do a new one, and could do much of this in psql, but who wants to stay in that syntax when things start to evolve?

 5119|wwmt|42.63222,-85.53778|{"42.63222,-85.53778"}
5120|thedenverchannel|39.730722,-105.232111|{"39.730722,-105.232111"}
5121|nbc4i|39.970972,-83.027556|{"39.970972,-83.027556"}
5103|ktvz|44.077611,-121.331417|{"44.077611,-121.331417"}
5122|kimt|43.47556,-92.70833|{"43.47556,-92.70833"}
5123|kgun9|32.415444,-110.714750|{"32.415444,-110.714750"}
5124|wcyb|36.44944,-82.10806|{"36.44944,-82.10806"}
5125|ktxs|32.413444,-100.107306|{"32.413444,-100.107306"}
5126|foxla|34.22472,-118.06306|{"34.22472,-118.06306"}
5127|q13fox|47.54778,-122.80750|{"47.54778,-122.80750"}
5128|krem|47.59472,-117.29917|{"47.59472,-117.29917"}
5129|centralillinoisproud|40.63500,-89.53861|{"40.63500,-89.53861"}
5130|ksby|35.36028,-120.65611|{"35.36028,-120.65611"}
5133|news8000|44.09111,-91.33806|{"44.09111,-91.33806"}

All we need to do is pull out that third field, round it off, count them, and sort. That file is all.geo.list.

Something like awk -F '|' '{split($3, geo, /,/); mycount[round(geo[1]), round(geo[2])]++} END { for (i in mycount) print mycount[i],i }' all.geo.list | sort -nr | less .

That’s an awk script that does the work on each line, line-by-line, then prints out the counts when done, then pipes to sort and uses less to view the output nicely.

The practical question is how much to round. round() is not a built-in awk function, so we get to provide it however we want. Integers are probably ok, but note that the second geo value is negative in these entries.

func round(x) { return int(.5+x) }

is the usual function, but perhaps we want to abs and sgn in there, so -2.8 becomes -3, not -2 (do the math!).

func abs(x) { if (x < 0) return -x; return x }
func sgn(x) { if (x < 0) return -1; return 1 }
func round(x) { return sgn(x) * int(.5 + abs(x)) }

we could quibble over sgn(0), but hey, throwaway script. One should have some idea where to locate oneself on the “correctness”/”wasted time and effort”/”increased complexity” curve (manifold). One can’t always be right, but being a precision for its own sake is often wrong. Always good to have discipline but many cs people have an impractical sense of perfection; one should have the discipline to be practical when it matters.

So now the command is not hard to put on a single line, but it does test one’s human parsing patience:

awk -F '|' '{split($3, geo, /,/); mycount[round(geo[1]), round(geo[2])]++} END { for (i in mycount) print mycount[i],i } func round(x) { return int(.5+x) }' all.geo.list | sort -nr | less

without the sign management, and with:

awk -F '|' '{split($3, geo, /,/); mycount[round(geo[1]), round(geo[2])]++} END { for (i in mycount) print mycount[i],i } func round(x) { return sgn(x) * int(.5 + abs(x)) } func abs(x) { if (x < 0) return -x; return x }
func sgn(x) { if (x < 0) return -1; return 1 }' all.geo.list | sort -nr | less

Apparently there are some self-designated geniuses and gurus on unix boards who think this is good practice. Well, to be fair, other languages have basic integer functions built-in and don’t need those functions. On the other hand, very nice choosing short strings for identifiers in throwaway scripting, because the namespace is not filled with other people’s functions.

So as a file it looks like this, temp.awk,

BEGIN { FS = "|" }
{
  split($3, geo, /,/) 
  mycount[round(geo[1]), round(geo[2])]++
} 
END { 
  for (i in mycount) print mycount[i],i 
}
func round(x) { return sgn(x) * int(.5 + abs(x)) } 
func abs(x) { if (x < 0) return -x; return x }
func sgn(x) { if (x < 0) return -1; return 1 }

and it’s invoked as awk -f temp.awk all.geo.list | sort -nr | list which is a lot nicer. We did have to change the ‘|’ separator to “|” because command line best practices don’t necessarily make good syntax within the program block.

I’m going to do it again. I’m going to move the filename explicitly into the middle pattern, so people can see what the shorthand is doing with pattern/action. And I’m going to move the output form into the script as well. Not because I think it’s necessary for a throwaway script, i.e., very probably throwaway, but because I think that in-script output control is a good habit to have when moving to more complicated things. And I’ve successfully taught this for decades to thousands. Yeah. Thousands.

BEGIN { 
  FS = "|"
  while (getline < "all.geo.list") {
    split($3, geo, /,/) 
    mycount[round(geo[1]), round(geo[2])]++
  }
  for (i in mycount) print mycount[i],i | "sort -nr | less"
}
func round(x) { return sgn(x) * int(.5 + abs(x)) } 
func abs(x) { if (x < 0) return -x; return x }
func sgn(x) { if (x < 0) return -1; return 1 

Note that I can put the output on a command pipe, and that command pipe can contain a pipe (sort, piped to less). Probably something one would not try in one’s first rodeo. I did have to move the END block into the BEGIN block because awk doesn’t like a missing middle pattern/action set.

Now the command line is just awk -f temp.awk .

Two more things. That getline returns -1 when the file is missing (or can’t be read). So we often get in the habit of writing

  while (getline < "all.geo.list" > 0) { ... }

Yes, this looks like <"all.geo.list"> in perl which is itself the way to read from a file! So confusing moving between languages. But it’s really

  while ( (getline < fname) > 0 ) { ... }

Change the file name so it has a typo, say all.geos.list, and the first version will spin on the while(-1) returned value, while the latter version will test >0, and exit on while(0) . This is an extremely important habit to adopt when writing cgi programs by the way.

The second change is prettying up the output. myarray[foo, bar] uses a SUBSEP character between foo and bar. This makes the strings less fun to read. So the increment could be

    mycount[round(geo[1]) "," round(geo[2])]++

which makes the output look like

48 34,-118
28 41,-74
25 38,-122

instead of

 48 34^-118
 28 41^-74
 25 38^-122

And we can actually think about what that means from a data point of view since we didn’t have to work so hard writing the data-inspection program.