Saturday 8 October 2011

Perl, MySQL and character sets

If you've never had to work with a combination of these three things, consider yourself lucky, because this can be really frustrating stuff.

When data enters a Perl program, the data should be decoded to Perl's internal format (encoding/character set). When data leaves Perl, the data should be encoded to the encoding that the external system expects. So when I am using the DBI module and its MySQL driver, I expect that those modules will handle the necessary character set conversions since they are closer to the external system (the database) than my code is. The data I send to DBI should be in Perl's internal format, DBI's MySQL driver should check which encoding the database uses to store the data, and the driver should convert the data to that character set. Similarly the other way; the MySQL driver knows which character set the database uses and should decode the data coming from the database to Perl's internal format. If this had been the case, I wouldn't have had to think about character sets in my own code.

Sadly, though, this is not how things work. So when I tried to fix the encoding problems that arose a couple of years ago, I did some improvising. I encoded data going into the database to utf-8 and decoded the data from utf-8 when fetching it from the database. It has worked without problems, but the data has been stored in the database with the wrong encoding. The database was set up to store data as latin1, and I think the MySQL driver expected to receive latin1-encoded data as well, so utf-8 data ended up in latin1 tables. And if some of the tables were set up to hold utf-8 data, the utf-8 data would have been encoded from latin1 to utf-8, so we would have ended up with data twice encoded to utf-8 (double-encoded). In both cases the data was encoded from latin1 to utf-8 one too many times.

We had known for a while that the data in the database had the wrong encoding, but everything worked, so we didn't do anything about it - until this week.

One of the things we had to figure out was how we should encode data going into the database and how to decode it coming back. As I already said, the MySQL driver should ideally handle the encoding and decoding, but this is not an ideal world, obviously. In the documentation for DBI (which is a database API for Perl) we can read the following:
Perl supports two kinds of strings: Unicode (utf-8 internally) and non-Unicode (defaults to iso-8859-1 if forced to assume an encoding). Drivers should accept both kinds of strings and, if required, convert them to the character set of the database being used. Similarly, when fetching from the database character data that isn't iso-8859-1 the driver should convert it into utf8.
To me, this is a little bit hard to understand, but since Perl can use either latin1 (iso-8859-1) or utf-8 to store data internally, I interpret it to mean that we can use Perl's internal format, both when sending data to and fetching it from the database. That would have been good, but it doesn't work. One thing that does work, on the other hand, is to connect to the database with the mysql_enable_utf8 option and then utf-8-encode data when storing it in the database. The data we get back from the database seems to be in Perl's internal format, so the documentation seems to be correct regarding that last part.

Making these changes, we can now store data correctly in the database and get it out correctly. But it still remains to convert existing database data. We first tried the following approach:
  1. Dump the database with mysqldump.
  2. Go through the lines in the dump file and encode each line from utf-8 to latin1.
  3. Import the modified dump (replacing the original database).
This approach worked for most, but not all characters. We didn't specify the character set we wanted the dump file to be in, so perhaps the content was encoded as Windows-1252 or something, but we haven't tried to verify that. Before we came up with that theory, we found a possible solution on the web, namely to store the database file as latin1, meaning that the character set of the file actually will be utf-8, since the character set of the database was incorrect. Then we could just reimport that utf-8 file without modifications, only making sure to tell MySQL that it is utf-8-encoded. (Well, actually, we did make some changes, we changed the encoding of the tables from latin1 to utf-8 in the table definitions, since we want to be able to store more characters than just those found in the latin1 character set.)

Amazingly, it worked, at least on our development server. Unamazingly, it did not work on the test server. It is extremely annoying when things don't behave the same way on all servers. We had three different theories for the difference in behavior.
  • For some reason the data in the database (after conversion) was not in the correct format even though it looked the same on both servers.
  • Difference in configuration of MySQL, the web server or some other system on the two servers.
  • Both Perl, MySQL, DBI and DBI's MySQL driver (DBD::mysql) had different version numbers on the two servers, so upgrading some of those on the test server could potentially solve the problem.
Long story short, the solution was to upgrade DBD::mysql on the test server. It turns out that the mysql_enable_utf8 option didn't work or hadn't been implemented in older versions of DBD::mysql, and one of those older versions was installed on the test server.

The last problem was resolved around 4pm on Friday, after about four days working on the whole database character set issue, and we were actually quite close to giving up. So it was a great relief when everything finally worked as it should at the end of the week. Let's just hope there aren't any hidden bugs we haven't found yet..!

Saturday 16 April 2011

MongoDB - usable for O2?

Most programmers are familiar with relational database management systems such as mysql, mssql, postgresql or oracle. In these systems the tables we create expect its data to be in a very specific format and an object is normally stored not in just one table; rather, its data are usually distributed across several database tables. For example, in O2, a Member object uses four tables for its data: O2_OBJ_OBJECT, O2_OBJ_PERSON, O2_OBJ_MEMBER and O2_OBJ_OBJECT_VARCHAR.

MongoDB is a so-called object oriented database management system, and it is supposed to be faster than relational database management systems and scale better. But it doesn't have all the functionality of the relational database management systems; most importantly it doesn't support joins.

MongoDB doesn't have tables or rows like "normal" databases. Instead it has something called collections and documents, but you can think of a collection as sort of a table, and of a document as a row.

When using Perl, as we do, a document is a hash ref. It has lots of keys and corresponding values. A value doesn't have to be a scalar, it can also be a reference to another hash, or a reference to an array. And MongoDB doesn't care about how that structure looks. You are free to insert data structures with totally different keys into the same collection. Whether this is a smart thing to do is another matter... We would probably store similar data structures in each separate collection ("table"). If we should start using MongoDB in O2, one object would likely correspond to one document in the collection which represents that object's class.

Objects in O2 can be related to each other. The way to do this is to set the type of a field to a class name - the class name of objects that can be stored in that field. In the database it is only the ID of those objects that is stored in the "main" object. This means that when we search for objects through related objects using relational databases, we have to join some tables in the search query, which can have a severe impact on performance, especially when we have to join more than a couple of tables, which sometimes happens in O2. MongoDB has something called DBRef, which is a reference to another object. I first thought it would be possible to search for fields in the related object through the main object and the DBRef, but it turns out that's not possible after all. Anyway, it might not have been that efficient either, if it were possible.

I believe the main reason why O2 is sometimes slow is that searches in related objects can take a long time, due to several joins, and because mysql sometimes doesn't execute the joins in the most efficient order. One way to mitigate this using MongoDB could be to store whole serialized objects instead of just object IDs in the database. This, however, has two challenges: 1) Duplication of data, and 2) How to update all the serialized objects when the actual object is updated.

We have to choose between fast response times and no duplication - we can't have both. At least not without (other forms of) caching. But caching makes things harder to debug, so the less caching we need, I think, the better.

When it comes to updating all serialized objects, I think this is possible if we store in the actual object the IDs of the objects that contain serialized copies of itself. And when the object is saved, we go through all of these other objects and update them, as well. This will make saving objects slower than today, but the question is by how much. If it is very slow, it might be possible to do it asynchronously, so that the user doesn't have to wait for it.

If we were going to represent Member and Person objects in MongoDB, we would probably create a Person collection and a Member collection. A Member inherits from Person, so the question is whether we should insert data into both the Person collection and the Member collection, or whether a Member object should be inserted only into Member. And if we decide to insert into Person as well, do we insert the entire Member object or just the part that's relevant for Persons?

My intuition tells me it would be best to insert Member objects only into Member, not Person. Which means that searching for Persons must search through both Person and Member.

In conclusion, MongoDB is a more natural data storage for O2 than relational database management systems, and, I think, better. And it might be possible to make searching really fast in MongoDB if we can tolerate duplication in the database and longer saving times.