Daily Data Cleaning Notes: Awk FTW 1

Posted
June 10th 2019

This is a new series of posts aimed at unix newbies or wannabes (we should all wanna be good unix data cleaners). Data cleaning seems to be an issue for data analysts and machine learning people, and indeed it is a constant job when working with data, big data, not quite so big data, or otherwise.

We’re not interested in arguing better ways of doing things (ahem, comp.lang.awk people who like cryptic solutions over easy prototypes and variations). Just helping out the people who might find the page.

Of course, if the people who find the tech interesting also find the civicfeed api’s interesting, great! Sign up your company at the free tier and if you need more, we can always talk.

Today’s task was to add frequent short words appearing in recent extracted phrases to a new, manually curated list. So the first step is to examine the data, which is in a file like this:

 veggie burgers:9927428 10555279 10767759 10624204 10727564 10617964 10773584 10583085 10738073 10735208
allow visitors:10579220 10560538 10545406 10577967 10455055 10693934 10413629 10512040 10488887 10748144
minister yanis varoufakis:10645667 10378236 10748292 10664133 10180502 10644678 10656481 10645404 10699818 10675541

and execute an awk command to pull the first field, delimited by colon:

awk -F ':' '{ print $1 }' alltermstofiles.list > bareterms1

And the result looks like this:

 sufficient infrastructure
vital issue
wants and needs
kerby jean-raymond
lynching case

Now, to be honest, I did the rest in vi, but one should try to do even the easy steps at the command line. Why? So one can just grab the command history and create a script for future automation. Also, vi command steps and stages are harder to show to readers.

So this will be a bit more pedantic than real life.

Sed comes in handy to get one word on each line, even though one could go straight to awk counting of words from here. Again, we’re not interested in the most clever way to do things because frankly, a lot of data cleaning is about data integrity. When one takes small steps and views the result at each step, one can keep an eye on data integrity.

sed 's/ /\n/g' bareterms1 | less

 wants
and
needs
kerby
jean-raymond
lynching
case

So we just extend the pipe with a common awk riff that does word counting. Here, each line (a “bare term”) is indexed into the c array, which is my counting array, and the array value c[] at that index [$0] is incremented (++). Yes, associative arrays, yes, auto-initialization at 0 upon first mention.

sed 's/ /\n/g' bareterms1 | awk '{ c[$0]++ } END { for (i in c) print c[i],i }' | less

 2 reliving
4 complexes
3 relaunch
3 ill-fated
1 waynesville
3 hippy
1 cugat
1 self-purification

This looks right, so I add a short-word restriction. I could add it at the end, but why let the associative arrays grow? This kind of data always makes one think about running out of real memory. The change is the insertion of “length($0)<5”. This restricts the counter-increment action to those lines where the term is short. There are plenty of ways to rewrite this, but this is about as clear as it gets.

sed 's/ /\n/g' bareterms1 | awk 'length($0)<5 { c[$0]++ } END { for (i in c) print c[i],i }' |less

 1 cdfw
2 bita
1 plo
1 nata
1 hala
2 bach

Add a sort numeric-reversed-order at the end:

sed 's/ /\n/g' bareterms1 | awk 'length($0)<5 { c[$0]++ } END { for (i in c) print c[i],i }' | sort -nr | less

 1797 time
1305 re
1016 year
879 food
835 plan
827 good
773 bill
757 tax
740 war
714 bank

and pipe it to a file:

sed 's/ /\n/g' bareterms1 | awk 'length($0)<5 { c[$0]++ } END { for (i in c) print c[i],i }' | sort -nr > bareterms2

Always good to hit tab-completion on that filename, after the prefix of, say, “bareterm” to make sure we don’t clobber some existing file, and make sure we are consistent if there are existing naming conventions.

That’s it for the auto-processing part. Next step is to eyeball bareterms2 and do the requisite manual work on it.

It’s always a good idea when looking at long lists to use your browser by the way. Why? Because the editor and terminal views of files just don’t give a big enough picture. One needs to look at the data a lot. In this case, my browser doesn’t render unicode characters without a mime type charset directive, but ok. I get to use ctrl-0, ctrl-minus, and page-down to visualize patterns and problems, insofar as ascii fixedwidth permits. One could also probe the data at the command line with various awk 1-liners, or just grep, but this is as far as we go today.