Update, two hours later: we have a solution! And it’s pretty disgusting. Read on below.
Two posts in a row about the deep technical guts of something I’m working on. Well I guess this is a digital humanities blog.
Yesterday I got a wonderful present in my email – a MySQL dump of a database full of all sorts of historical goodness. The site that it powers displays snippets of relevant primary sources in their original language, including things like Arabic and Greek. Since the site has been around for rather longer than MySQL has had any Unicode support to speak of, it is not all that surprising that these snippets of text in their original language are rather badly mis-encoded.
Not too much of a problem, I naïvely thought to myself. I’ll just fix the encoding knowing what it’s supposed to have been.
A typical example looks like this. The Greek displayed on the site is:
μηνὶ Νοἐμβρίω εἰς τὰς κ ´ ινδικτιῶνος ε ´ ἔτους ,ς
but what I get from the database dump is:
μηνὶ Îοá¼Î¼Î²Ïίω εἰς Ï„á½°Ï‚ κ ´ ινδικτιῶνος ε ´ ἔτους ,Ï‚
Well, I recognise that kind of garbage, I thought to myself. It’s double-encoded UTF-8. So all I ought to need to do is to undo the spurious re-encoding and save the result. Right?
Sadly, it’s not that easy, and here is where I hope I can get comments from some DB/encoding wizards out there because I would really like to understand what’s going on.
It starts easily enough in this case – the first letter is μ. In Unicode, that is character 3BC (notated in hexadecimal.) When you convert this to UTF-8, you get two bytes: CE BC. Unicode character CE is indeed Î, and Unicode character BC is indeed ¼. As I suspected, each of these UTF-8 bytes that make up μ has been treated as a character in its own right, and further encoded to UTF-8, so that μ has become μ. That isn’t hard to undo.
But then we get along to that ω further down the line, which has become ω. That is Unicode character 3C9, which in UTF-8 becomes CF 89. Unicode CF is the character Ï as we expect, but there is no such Unicode character 89. Now it is perfectly possible to render 89 as UTF-8 (it would become C2 89) but instead I’m getting a rather inexplicable character whose Unicode value is 2030 (UTF-8 E2 80 B0)! And here the system starts to break down – I cannot figure out what possible mathematical transformation has taken place to make 89 become 2030.
There seems to be little mathematical pattern to the results I’m getting, either. From the bad characters in this sample:
ρ -> 3C1 -> CF 81 --> CF 81 (correct!!)
ς -> 3C2 -> CF 82 --> CF 201A
τ -> 3C4 -> CF 84 --> CF 201E
υ -> 3C5 -> CF 85 --> CF 2026
ω -> 3C9 -> CF 89 --> CF 2030
Ideas? Comments? Do you know MySQL like the back of your hand and have you spotted immediately what’s going on here? I’d love to crack this mystery.
After this post went live, someone observed to me that the ‘per mille’ sign, i.e. that double-percent thing at Unicode value 2030, has the value 89 in…Windows CP-1250! And, perhaps more relevantly, Windows CP-1252. (In character encodings just as in almost everything else, Windows always liked to have their own standards that are different from the ISO standards. Pre-Unicode, most Western European characters were represented in an eight-bit encoding called ISO Latin 1 everywhere except Windows*, where they used this CP-1252 instead. For Eastern Europe, it was ISO Latin 2 / CP-1250.)
So what we have here is: MySQL is interpreting its character data as Unicode, and expressing it as UTF-8, as we requested. Only then it hits a Unicode value like 89 which is not actually a character at all. But instead of passing it through and letting us deal with it, MySQL says “hm, they must have meant the Latin 1 value here. Only when I say Latin 1 I really mean CP-1252. So I’ll just take this value (89 in our example), see that it is the ‘per mille’ sign in CP-1252, and substitute the correct Unicode for ‘per mille’. That will make the user happy!”
Hint: It really, really, doesn’t make the user happy.
So here is the Perl script that will take the garbage I got and turn it back into Greek. Maybe it will be useful to someone else someday!
#!/usr/bin/env perl
use strict;
use warnings;
use Encode;
use Encode::Byte;
while(<>) {
my $line = decode_utf8( $_ );
my @chr;
foreach my $c ( map { ord( $_ ) } split( '', $line ) ) {
if( $c > 255 ) {
$c = ord( encode( 'cp1252', chr( $c ) ) );
}
push( @chr, $c );
}
my $newline = join( '', map { chr( $_ ) } @chr );
print $newline;
}
[*] Also, as I realized after posting this, except Mac, which used MacRoman. Standards are great! Let’s all have our own!