Saturday, 8 October 2011

Perl, MySQL and character sets

If you've never had to work with a combination of these three things, consider yourself lucky, because this can be really frustrating stuff.

When data enters a Perl program, the data should be decoded to Perl's internal format (encoding/character set). When data leaves Perl, the data should be encoded to the encoding that the external system expects. So when I am using the DBI module and its MySQL driver, I expect that those modules will handle the necessary character set conversions since they are closer to the external system (the database) than my code is. The data I send to DBI should be in Perl's internal format, DBI's MySQL driver should check which encoding the database uses to store the data, and the driver should convert the data to that character set. Similarly the other way; the MySQL driver knows which character set the database uses and should decode the data coming from the database to Perl's internal format. If this had been the case, I wouldn't have had to think about character sets in my own code.

Sadly, though, this is not how things work. So when I tried to fix the encoding problems that arose a couple of years ago, I did some improvising. I encoded data going into the database to utf-8 and decoded the data from utf-8 when fetching it from the database. It has worked without problems, but the data has been stored in the database with the wrong encoding. The database was set up to store data as latin1, and I think the MySQL driver expected to receive latin1-encoded data as well, so utf-8 data ended up in latin1 tables. And if some of the tables were set up to hold utf-8 data, the utf-8 data would have been encoded from latin1 to utf-8, so we would have ended up with data twice encoded to utf-8 (double-encoded). In both cases the data was encoded from latin1 to utf-8 one too many times.

We had known for a while that the data in the database had the wrong encoding, but everything worked, so we didn't do anything about it - until this week.

One of the things we had to figure out was how we should encode data going into the database and how to decode it coming back. As I already said, the MySQL driver should ideally handle the encoding and decoding, but this is not an ideal world, obviously. In the documentation for DBI (which is a database API for Perl) we can read the following:
Perl supports two kinds of strings: Unicode (utf-8 internally) and non-Unicode (defaults to iso-8859-1 if forced to assume an encoding). Drivers should accept both kinds of strings and, if required, convert them to the character set of the database being used. Similarly, when fetching from the database character data that isn't iso-8859-1 the driver should convert it into utf8.
To me, this is a little bit hard to understand, but since Perl can use either latin1 (iso-8859-1) or utf-8 to store data internally, I interpret it to mean that we can use Perl's internal format, both when sending data to and fetching it from the database. That would have been good, but it doesn't work. One thing that does work, on the other hand, is to connect to the database with the mysql_enable_utf8 option and then utf-8-encode data when storing it in the database. The data we get back from the database seems to be in Perl's internal format, so the documentation seems to be correct regarding that last part.

Making these changes, we can now store data correctly in the database and get it out correctly. But it still remains to convert existing database data. We first tried the following approach:
  1. Dump the database with mysqldump.
  2. Go through the lines in the dump file and encode each line from utf-8 to latin1.
  3. Import the modified dump (replacing the original database).
This approach worked for most, but not all characters. We didn't specify the character set we wanted the dump file to be in, so perhaps the content was encoded as Windows-1252 or something, but we haven't tried to verify that. Before we came up with that theory, we found a possible solution on the web, namely to store the database file as latin1, meaning that the character set of the file actually will be utf-8, since the character set of the database was incorrect. Then we could just reimport that utf-8 file without modifications, only making sure to tell MySQL that it is utf-8-encoded. (Well, actually, we did make some changes, we changed the encoding of the tables from latin1 to utf-8 in the table definitions, since we want to be able to store more characters than just those found in the latin1 character set.)

Amazingly, it worked, at least on our development server. Unamazingly, it did not work on the test server. It is extremely annoying when things don't behave the same way on all servers. We had three different theories for the difference in behavior.
  • For some reason the data in the database (after conversion) was not in the correct format even though it looked the same on both servers.
  • Difference in configuration of MySQL, the web server or some other system on the two servers.
  • Both Perl, MySQL, DBI and DBI's MySQL driver (DBD::mysql) had different version numbers on the two servers, so upgrading some of those on the test server could potentially solve the problem.
Long story short, the solution was to upgrade DBD::mysql on the test server. It turns out that the mysql_enable_utf8 option didn't work or hadn't been implemented in older versions of DBD::mysql, and one of those older versions was installed on the test server.

The last problem was resolved around 4pm on Friday, after about four days working on the whole database character set issue, and we were actually quite close to giving up. So it was a great relief when everything finally worked as it should at the end of the week. Let's just hope there aren't any hidden bugs we haven't found yet..!