Saturday 8 October 2011

Perl, MySQL and character sets

If you've never had to work with a combination of these three things, consider yourself lucky, because this can be really frustrating stuff.

When data enters a Perl program, the data should be decoded to Perl's internal format (encoding/character set). When data leaves Perl, the data should be encoded to the encoding that the external system expects. So when I am using the DBI module and its MySQL driver, I expect that those modules will handle the necessary character set conversions since they are closer to the external system (the database) than my code is. The data I send to DBI should be in Perl's internal format, DBI's MySQL driver should check which encoding the database uses to store the data, and the driver should convert the data to that character set. Similarly the other way; the MySQL driver knows which character set the database uses and should decode the data coming from the database to Perl's internal format. If this had been the case, I wouldn't have had to think about character sets in my own code.

Sadly, though, this is not how things work. So when I tried to fix the encoding problems that arose a couple of years ago, I did some improvising. I encoded data going into the database to utf-8 and decoded the data from utf-8 when fetching it from the database. It has worked without problems, but the data has been stored in the database with the wrong encoding. The database was set up to store data as latin1, and I think the MySQL driver expected to receive latin1-encoded data as well, so utf-8 data ended up in latin1 tables. And if some of the tables were set up to hold utf-8 data, the utf-8 data would have been encoded from latin1 to utf-8, so we would have ended up with data twice encoded to utf-8 (double-encoded). In both cases the data was encoded from latin1 to utf-8 one too many times.

We had known for a while that the data in the database had the wrong encoding, but everything worked, so we didn't do anything about it - until this week.

One of the things we had to figure out was how we should encode data going into the database and how to decode it coming back. As I already said, the MySQL driver should ideally handle the encoding and decoding, but this is not an ideal world, obviously. In the documentation for DBI (which is a database API for Perl) we can read the following:
Perl supports two kinds of strings: Unicode (utf-8 internally) and non-Unicode (defaults to iso-8859-1 if forced to assume an encoding). Drivers should accept both kinds of strings and, if required, convert them to the character set of the database being used. Similarly, when fetching from the database character data that isn't iso-8859-1 the driver should convert it into utf8.
To me, this is a little bit hard to understand, but since Perl can use either latin1 (iso-8859-1) or utf-8 to store data internally, I interpret it to mean that we can use Perl's internal format, both when sending data to and fetching it from the database. That would have been good, but it doesn't work. One thing that does work, on the other hand, is to connect to the database with the mysql_enable_utf8 option and then utf-8-encode data when storing it in the database. The data we get back from the database seems to be in Perl's internal format, so the documentation seems to be correct regarding that last part.

Making these changes, we can now store data correctly in the database and get it out correctly. But it still remains to convert existing database data. We first tried the following approach:
  1. Dump the database with mysqldump.
  2. Go through the lines in the dump file and encode each line from utf-8 to latin1.
  3. Import the modified dump (replacing the original database).
This approach worked for most, but not all characters. We didn't specify the character set we wanted the dump file to be in, so perhaps the content was encoded as Windows-1252 or something, but we haven't tried to verify that. Before we came up with that theory, we found a possible solution on the web, namely to store the database file as latin1, meaning that the character set of the file actually will be utf-8, since the character set of the database was incorrect. Then we could just reimport that utf-8 file without modifications, only making sure to tell MySQL that it is utf-8-encoded. (Well, actually, we did make some changes, we changed the encoding of the tables from latin1 to utf-8 in the table definitions, since we want to be able to store more characters than just those found in the latin1 character set.)

Amazingly, it worked, at least on our development server. Unamazingly, it did not work on the test server. It is extremely annoying when things don't behave the same way on all servers. We had three different theories for the difference in behavior.
  • For some reason the data in the database (after conversion) was not in the correct format even though it looked the same on both servers.
  • Difference in configuration of MySQL, the web server or some other system on the two servers.
  • Both Perl, MySQL, DBI and DBI's MySQL driver (DBD::mysql) had different version numbers on the two servers, so upgrading some of those on the test server could potentially solve the problem.
Long story short, the solution was to upgrade DBD::mysql on the test server. It turns out that the mysql_enable_utf8 option didn't work or hadn't been implemented in older versions of DBD::mysql, and one of those older versions was installed on the test server.

The last problem was resolved around 4pm on Friday, after about four days working on the whole database character set issue, and we were actually quite close to giving up. So it was a great relief when everything finally worked as it should at the end of the week. Let's just hope there aren't any hidden bugs we haven't found yet..!

Saturday 16 April 2011

MongoDB - usable for O2?

Most programmers are familiar with relational database management systems such as mysql, mssql, postgresql or oracle. In these systems the tables we create expect its data to be in a very specific format and an object is normally stored not in just one table; rather, its data are usually distributed across several database tables. For example, in O2, a Member object uses four tables for its data: O2_OBJ_OBJECT, O2_OBJ_PERSON, O2_OBJ_MEMBER and O2_OBJ_OBJECT_VARCHAR.

MongoDB is a so-called object oriented database management system, and it is supposed to be faster than relational database management systems and scale better. But it doesn't have all the functionality of the relational database management systems; most importantly it doesn't support joins.

MongoDB doesn't have tables or rows like "normal" databases. Instead it has something called collections and documents, but you can think of a collection as sort of a table, and of a document as a row.

When using Perl, as we do, a document is a hash ref. It has lots of keys and corresponding values. A value doesn't have to be a scalar, it can also be a reference to another hash, or a reference to an array. And MongoDB doesn't care about how that structure looks. You are free to insert data structures with totally different keys into the same collection. Whether this is a smart thing to do is another matter... We would probably store similar data structures in each separate collection ("table"). If we should start using MongoDB in O2, one object would likely correspond to one document in the collection which represents that object's class.

Objects in O2 can be related to each other. The way to do this is to set the type of a field to a class name - the class name of objects that can be stored in that field. In the database it is only the ID of those objects that is stored in the "main" object. This means that when we search for objects through related objects using relational databases, we have to join some tables in the search query, which can have a severe impact on performance, especially when we have to join more than a couple of tables, which sometimes happens in O2. MongoDB has something called DBRef, which is a reference to another object. I first thought it would be possible to search for fields in the related object through the main object and the DBRef, but it turns out that's not possible after all. Anyway, it might not have been that efficient either, if it were possible.

I believe the main reason why O2 is sometimes slow is that searches in related objects can take a long time, due to several joins, and because mysql sometimes doesn't execute the joins in the most efficient order. One way to mitigate this using MongoDB could be to store whole serialized objects instead of just object IDs in the database. This, however, has two challenges: 1) Duplication of data, and 2) How to update all the serialized objects when the actual object is updated.

We have to choose between fast response times and no duplication - we can't have both. At least not without (other forms of) caching. But caching makes things harder to debug, so the less caching we need, I think, the better.

When it comes to updating all serialized objects, I think this is possible if we store in the actual object the IDs of the objects that contain serialized copies of itself. And when the object is saved, we go through all of these other objects and update them, as well. This will make saving objects slower than today, but the question is by how much. If it is very slow, it might be possible to do it asynchronously, so that the user doesn't have to wait for it.

If we were going to represent Member and Person objects in MongoDB, we would probably create a Person collection and a Member collection. A Member inherits from Person, so the question is whether we should insert data into both the Person collection and the Member collection, or whether a Member object should be inserted only into Member. And if we decide to insert into Person as well, do we insert the entire Member object or just the part that's relevant for Persons?

My intuition tells me it would be best to insert Member objects only into Member, not Person. Which means that searching for Persons must search through both Person and Member.

In conclusion, MongoDB is a more natural data storage for O2 than relational database management systems, and, I think, better. And it might be possible to make searching really fast in MongoDB if we can tolerate duplication in the database and longer saving times.

Thursday 18 March 2010

Exceptionally cool things you can do with LinkPulse: The Red Box

We already knew that our customers are cool innovators with great ideas. That's why they use LinkPulse. But we were nonetheless struck by awe when we found out one of them had created this red box click counter with data from LinkPulse.

Per Åstrøm is Technical Manager New Platforms at Tv4.se and together with David Hall he made this during a "hack day" at Bonnier (owner company of Tv4).

Basically, it shows a number known in LP as "State Now", based on traffic the past 15 minutes, in addition to an up or down pointing arrow indicating whether traffic is going up or down.

Data is fed by an XML feed from the LinkPulse application.

We love it!!

Tuesday 26 January 2010

Nettavisen increases traffic after changing the frontpage

The Norwegian news outlet and LinkPulse customer, Nettavisen, yesterday announced a 14% increase in pageviews and 40% increase in clickthrough after changing their frontpage layout.

Clickthrough is basically the conversion of page views to clicks; i.e. if you go to the front page, do you click on an article?

We believe that such a dramatic change could only come as a result of a dedicated effort to see what users do and what users don't do on their frontpage.

We are happily convinced that LinkPulse was an important part of the project and send our warmest congratulations to the staff over at Nettavisen.

Cheers!

Thursday 21 January 2010

klikk.no receives a Golden Tag for 2009

Our congratulations go to our customer klikk.no (Hjemmet Mortensen) who received a diploma for best website in the 2009 Golden Tag Awards.

Klikk.no is one of the fastest growing internet magazines in Norway, and we are pleased to know that LinkPulse is used every day to make that happen.

For instance, we are grateful to klikk.no for beta-testing our new browser Toolbar, which will surely change everyone's lives when we release it this spring!

Keep up the good work!

Tuesday 19 January 2010

Remoteless: Remote control for Spotify and Iphone - I tested it.

Disclaimer: My friend is on the development team of Remoteless.

Last night I participated in a very exciting informal user test of an upcoming Iphone app called Remoteless, scheduled to arrive in App Store around March 1, 2010. It's a remote control for Spotify (on Windows).

First of all, I must say I'm impressed by the job done, since Spotify offers no open API to control it they had to use image processing and interaction simulation to communicate with the Spotify client. It requires you to install a little program on your Windows computer, and for now, that's the only platform they support.

Since I am a Mac user I don't reckon I will buy the app, but having tried I would certainly recommend it to Windows users out there who need an easy way to switch music playing on your computer without getting out of the couch.

As opposed to other apps that let you control the computer remotely, this app here actually communicates with the public metadata API of Spotify to search for artists, albums and songs, and when you are ready to play a song, its URI is sent to Spotify along with some double-click events and some such.

It even sports the ability to save tracks, albums and even artists as favorites, something I've been missing from Spotify!

Scheduled for release around the same time as Spotify's arrival in the US, I suppose this app will do well, and I think it deserves it!

Wednesday 18 November 2009

In beta: Web analytics tools market share in Nordic news outlets

I was inspired by KAIZEN Analytics recent post on web analytics tools market shares in the automotive industry, to do a similar study on online news outlets in the Nordic countries, specifically Norway, Sweden, Denmark and Finland.

Just as Kaizen, I will use WASP to inquire each web site what tools they use. Moreover, I only care about core web analytics, not ad trackers etc. I also took the liberty to add the LinkPulse numbers (as WASP does not recognize it, but I know who they are).

I would like to emphasize at this point that this is a research project in the making, and I'm publishing preliminary results to see if there's interest in these numbers out there. For now, I have far more data for Norway than for the other countries because it's much easier for me to decide which ones to count and which ones not to count. Basically I've tried to include web sites that correspond to daily news papers, as well as online only sites which center around dissemination of news, may be portals or niche sites such as sports or economics. Another criteria I've been considering is amount of traffic according to official metrics, but haven't followed this strictly so far.

Further work on this research will include establishing more rigid selection criteria and gathering more data from the other Nordic countries. Other considerations to make are groups of sites that buy tools collectively because of common ownership, as well as using the fact that most sites have more than one tool.

Another interesting possibility is to factor in each sites' official traffic numbers. Thereby we could see something about what tools account for the most traffic, or some such.

As of now, I have counted the tools on 66 sites, of which a little under half are Norwegian.

One difficulty is to account for at least two "disturbing" factors in the data. One being that one of the tools, Google Analytics, is free and therefore has a far lower threshold for use than the other tools. The other disturbance is that all the Nordic countries have one tool that is used across the board due to common agreement to provide official metrics.

I haven't decided how to account for some of these issues and therefore present the data with a big juicy footnote to consider it largely incomplete. I think, nonetheless they give an interesting view of what is being used generally in news sites.


Nordic countries
The first pie shows the usage of web analytics tools in Norway, Sweden, Denmark and Finland, weighted so that it totals to 100% even though some sites use more than one tool. These numbers are schewed by the fact that almost half the data are from Norwegian sites. Another difficulty is that TNS metrix and Gemius are "forced" tools in some countries, but also used voluntarily in other countries. This is an issue that I will work to solve.

Norway only
The second pie focuses on the Norwegian data only. Also I have removed both Google Analytics and TNS Metrix. Google Analytics is free and therefore used by virtually everyone and TNS Metrix is the official tool in Norway, therefore used by literally everyone (at least all the sites that I gathered data about). The usage of these may be interesting in and of itself, but may overshadow interesting facts about usage of the other tools.

 No conclusions

Since this endeavor has just started, and since there are yet so many issues to be resolve, I refrain from making any sort of conclusions about the data so far. I just let the pies stand as they are.

What I would like to get comments on is if these sorts of numbers are interesting for anyone out there to follow, as well as suggestions on how to resolve some of the issues I have raised that may schew the data.

Also, if you know of anyone else doing similar research, I'd love to know about it, especially if it's related to online news media.

Of course, if there's anything else you have on your mind about any of this, feel free to leave a note.