We've changed our name!  CivicFeed is now PeakMetrics

Daily Data Cleaning Notes: Awk FTW 5

June 26th 2019

A lot of people can de-URI input from the web that comes in from GET or POST. But those routines aren’t always the most readable or customizable. Or extensible. Here’s how we can do this in about 15min of awk.

First, it’s nice to find a good authoritative source in a useful format without too much markup:

https://www.w3schools.com/tags/ref_urlencode.asp

Fits the bill. Pop into the source and the useful lines look like this:

<tr>
  <td>$</td>
  <td>%24</td>
  <td>%24</td>
</tr>

Actually, in the real-world dev of this, I did a copy-and-paste from the browser displayed table, and avoided the html, but let’s go ahead with the source here.

lynx -source 'https://www.w3schools.com/tags/ref_urlencode.asp' | grep '<td>'| less

shows that we are very lucky — no irregularity of <td> tags, so every three lines is a group for us.

lynx -source 'https://www.w3schools.com/tags/ref_urlencode.asp' | grep '<td>' |sed 's/<td>//'| sed 's/<\/td>//' | sed 's/^ *//' | less

This will shed the <td> and </td> tags, and the leading space if it annoys you. Awk won’t care about the leading space next, but hey. So the group of three includes the proper ascii character, the windows-1252-style encoding, and the utf-8 encoding. We want to generate something like

gsub(/%24/,"$",x)

for each group. Problem is that there are TWO %-encodings for each target character. For example, prior to removing the <td> markup, superscript 3 could be a %B3 or a %C2%B3.

By the way, we’ll call the first a 1-byte URI-encoding and the second a 2-byte encoding, because these WERE 1- and 2-byte characters before they got URI-encoded.

And we’ll want

gsub(/%B3/, "", x)
gsub(%C2%B3/, "", x)

and by the way, we’ll want to do the 2-byte gsub before the 1-byte gsub or the %C2 will be orphaned after the removal of the %B3! We’ll worry about order after we figure out how to generate them.

  <td>³</td>
  <td>%B3</td>
  <td>%C2%B3</td>

Now, one way to do this is to generate a line for each group of three:

³ %B3 %C2%B3

where there are three fields. Then read this into the script at run-time from a file, e.g.

BEGIN {
while (getline < "decode.uri.list" > 0) {
sto[$2] = $1; sto[$3] = $1
}
}
function decode(x, i) {
for (i in sto) gsub(i, sto[i], x)
return x
}

Looks quite clean!

But two things here. It’s a little more trouble to address the ordering in this form (one could store the ordering, not just the association, and order the file; or one could make three passes through the sto list). More importantly, it’s nice to have self-contained scripts that don’t do i/o unnecessarily at run-time for a dictionary, especially if it’s not too hard to include a section of code as a bunch of gsubs.

So continuing with the desire to generate a lot of gsubs, we might do:

lynx -source 'https://www.w3schools.com/tags/ref_urlencode.asp' | grep '<td>'| sed 's/<td>//'| sed 's/<\/td>//' | sed 's/^ *//' | awk '{ c=$0; getline p1; getline p2; print "gsub(" p1 "," c ")";  print "gsub(" p2 "," c ")" }' | less

The key awk idea here is that for each line, we store a value of c, then get the next TWO LINES, as p1 and p2, the patterns we will transform c into. One could argue that the command line is already getting pretty hairy here, and popping into a full-blown awk script, or the editor, would be cleaner. In the real development, yes, it was all done in the editor. But adding a command to a series of commands, one at a time, is actually quite easy and safe. The real limit is the dept and complexity of all the quoting and backslashing that naturally occurs at the command line.

At this point, things are looking a bit heavy, and we don’t want to keep hitting the web for the data while we’re doing script-dev, so let’s put this data in an intermediate file:

lynx -source 'https://www.w3schools.com/tags/ref_urlencode.asp' | grep '<td>'| sed 's/<td>//'| sed 's/<\/td>//' | sed 's/^ *//' | awk '{ c=$0; getline p1; getline p2; print "gsub(" p1 "," c ")";  print "gsub(" p2 "," c ")" }' > temp1

temp1 looks like this:

gsub(%20^M,space^M)
gsub(%20^M,space^M)
gsub(%21^M,!^M)
gsub(%21^M,!^M)
gsub(%22^M,"^M)
gsub(%22^M,"^M)

So there is still much work to be done. The first arguments must be enclosed in slashes, the second arguments in double quotes, and some special cases will be trouble: ", &, space, <, etc. And those horrible ctrl-M‘s need to be removed. These can all be done better in the editor, but if we want, we can do some of it at the command line if we choose to be stubborn.

In vi, I have done

:g/./ s/^M//g

where I had to do a ctrl-v to get ctrl-m, an old vi thing. Sorry about that — it’s a special char special case for sure — but it’s not so bad. Then

:g/./ s/gsub./ gsub(\/

:g/./ s/,/\/,"

:g/./ s/.$/", x)

And this makes the lines look like:

gsub(/%20/,"space", x)
gsub(/%20/,"space", x)
gsub(/%21/,"!", x)
gsub(/%21/,"!", x)
gsub(/%22/,""", x)
gsub(/%22/,""", x)

Which is getting closer.

Now let’s not fuss over the edits on the """ to become "\"" and "&" to become “\\&". "space" has to become " ". These are some annoying weird cases, and the point of doing this as a list is to make the programmer think about some of these cases. You always want to be careful letting user-generated input stick special characters in your data — after all, that’s one reason why so much data is URI-encoded in the first place!

But getting back to the ordering, it’s quite easy at this point, even if it takes a bit of trial-and-error:

grep '%.%.%.*,' temp1 | sort -u > temp2
grep '%.%.,' temp1 | grep -v '%.%.%.*,' | sort -u >> temp2
grep '%.,' temp1 | grep -v '%.%.*,' | sort -u >> temp2

This will put the 3-byte left-hand-sides (see how the , is used to make sure the % is entirely within the target of the gsub, not in the replacement string, which could itself be a %!) on temp2, without repeats (there are no repeats for 3-byte characters, but there will be for the 2-byte and 1-byte ones!). Then the second command line will append 2-byte left-hand-sides. It does this by matching 2-or-more, then removing 3-or-more with the grep -v, a most useful grep option, and it appends with the >>. Then the third command line will append 1-byte left-hand-sides.

In this way, we enforce the ordering of gsubs so that those with short substitution targets do not prematurely eat substrings of those with longer substitution targets.

Actually, by doing this in the command-line as we have done, the trial-and-error process of getting the grep patterns and the grep -v filters correct is quite easy. Make a mistake? Start again with the first command, and use the arrow to retrieve and modify where you went wrong.

One might not have thought of the importance of the ordering at first, but sadly, this is exactly the kind of thing that makes data cleaning so tricky.

In fact, one could have dealt with the ordering earlier in our process, but often one doesn’t think of optimal solutions and has to recover, so we’ll leave this the way it is. In retrospect, dealing with the ordering before introducing the gsub syntax would have been smarter. In the real-world development of this function, the sort order was fixed in the editor, just FYI.

The final awk would look like this:

function deuri(x) {
  gsub(/%E2%80%93/,"–")
  gsub(/%E2%80%94/,"—")
  gsub(/%E2%80%98/,"‘")
  gsub(/%E2%80%99/,"’")
  gsub(/%E2%80%9A/,"‚")
...
   gsub(/%C3%AA/,"ê")
   gsub(/%C3%AB/,"ë")
   gsub(/%C3%AC/,"ì")
   gsub(/%C3%AD/,"í")
   gsub(/%C3%AE/,"î")
...
   gsub(/%FC/,"ü")
   gsub(/%FD/,"ý")
   gsub(/%FE/,"þ")
   gsub(/%FF/,"ÿ")
  return x
}

Is this function really 381+2 lines long? Sure. Isn’t it possible to do this with a fancy call to hexadecimal-aware code in about 10 lines, well, maybe 25 with the longest-first ordering enforced? Sure. But that’s exactly the code we’re replacing because that code (which came from some online unix wizard on a help board, actually) didn’t give us sufficient control.

Ready to try PeakMetrics?