An Unexpected Character Replacement, Hacker News

For a full list ofBASHing datablog posts see theindex page.******

As a data auditor I’m used to seeing non-ASCII characters appearing as replacement characters, question marks andmojibake (************:

A few weeks ago I found a replacement inGBIFthat I’d never seen before:Mller. It was a hexadecimal value for the character “ü” enclosed in angle brackets. That particular hex value for “ü” appears in Windows – 1252 and other encodings, but what program did this replacement? And why?

Suspecting the worst, I did a search for other angle-bracket-enclosed strings in the dataset. The search turned up a lot of data items which had originally contained a non-breaking space, and which now contained that character’s Unicode representation in brackets, for exampleLaevicardium. Excluding these, the result is shown here:

The characters replaced by hexadecimal values all seemed to be in Windows – 1252 encoding:

The Unicode replacements are a bit less obvious, as both are control characters. “U 0092 “is PU2 (private use 2) and” U 009 A “is SCI (single character introducer). The first one appears in places you would expect a single quote, and aright single quotehas the Windows – 1252 encoding “92 “in hexadecimal. So why didn’t that appear as “”?

The same happened with the other Unicode replacement, “U 009 A “. The original was “Duriš” (it should actually have been “Ďuriš”, but that’s another issue), and that final “s” with a caron is encoded in Windows – 1252 with the hex value “9a”.

Baffled, I went to the website of the organization in the USA that managed the dataset sent to GBIF. There was an Excel version of the data available for download, and when I opened it all the original non-ASCII characters were present and none had been replaced.

I then contacted the US scientists who compiled the data and sent it to GBIF. The dataset started out as a Microsoft Excel file, presumably in Windows – 1252 encoding. This was converted into a CSV, then loaded into the R environment for adding additional information required by GBIF, then exported from R as a text file with the command optionfileEncoding=”UTF-8 “.

The culprit seems to be R. For example, the text-cleaning functionreplace_non_asciiwill generate output with hexadecimal values within angle brackets. I’m not sure whether that tool or some other R “text cleaner” was also doing the incorrect Unicode replacements.

In the meantime, I’ve added another data check to my auditing routine:look for weird character replacements in angle brackets.

Last update: 2019 – 10 – 18
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License