Saturday 16 April 2011

MongoDB - usable for O2?

Most programmers are familiar with relational database management systems such as mysql, mssql, postgresql or oracle. In these systems the tables we create expect its data to be in a very specific format and an object is normally stored not in just one table; rather, its data are usually distributed across several database tables. For example, in O2, a Member object uses four tables for its data: O2_OBJ_OBJECT, O2_OBJ_PERSON, O2_OBJ_MEMBER and O2_OBJ_OBJECT_VARCHAR.

MongoDB is a so-called object oriented database management system, and it is supposed to be faster than relational database management systems and scale better. But it doesn't have all the functionality of the relational database management systems; most importantly it doesn't support joins.

MongoDB doesn't have tables or rows like "normal" databases. Instead it has something called collections and documents, but you can think of a collection as sort of a table, and of a document as a row.

When using Perl, as we do, a document is a hash ref. It has lots of keys and corresponding values. A value doesn't have to be a scalar, it can also be a reference to another hash, or a reference to an array. And MongoDB doesn't care about how that structure looks. You are free to insert data structures with totally different keys into the same collection. Whether this is a smart thing to do is another matter... We would probably store similar data structures in each separate collection ("table"). If we should start using MongoDB in O2, one object would likely correspond to one document in the collection which represents that object's class.

Objects in O2 can be related to each other. The way to do this is to set the type of a field to a class name - the class name of objects that can be stored in that field. In the database it is only the ID of those objects that is stored in the "main" object. This means that when we search for objects through related objects using relational databases, we have to join some tables in the search query, which can have a severe impact on performance, especially when we have to join more than a couple of tables, which sometimes happens in O2. MongoDB has something called DBRef, which is a reference to another object. I first thought it would be possible to search for fields in the related object through the main object and the DBRef, but it turns out that's not possible after all. Anyway, it might not have been that efficient either, if it were possible.

I believe the main reason why O2 is sometimes slow is that searches in related objects can take a long time, due to several joins, and because mysql sometimes doesn't execute the joins in the most efficient order. One way to mitigate this using MongoDB could be to store whole serialized objects instead of just object IDs in the database. This, however, has two challenges: 1) Duplication of data, and 2) How to update all the serialized objects when the actual object is updated.

We have to choose between fast response times and no duplication - we can't have both. At least not without (other forms of) caching. But caching makes things harder to debug, so the less caching we need, I think, the better.

When it comes to updating all serialized objects, I think this is possible if we store in the actual object the IDs of the objects that contain serialized copies of itself. And when the object is saved, we go through all of these other objects and update them, as well. This will make saving objects slower than today, but the question is by how much. If it is very slow, it might be possible to do it asynchronously, so that the user doesn't have to wait for it.

If we were going to represent Member and Person objects in MongoDB, we would probably create a Person collection and a Member collection. A Member inherits from Person, so the question is whether we should insert data into both the Person collection and the Member collection, or whether a Member object should be inserted only into Member. And if we decide to insert into Person as well, do we insert the entire Member object or just the part that's relevant for Persons?

My intuition tells me it would be best to insert Member objects only into Member, not Person. Which means that searching for Persons must search through both Person and Member.

In conclusion, MongoDB is a more natural data storage for O2 than relational database management systems, and, I think, better. And it might be possible to make searching really fast in MongoDB if we can tolerate duplication in the database and longer saving times.

3 comments:

  1. Hei
    Kul at du skaper litt liv på bloggen vår, selv om jeg ikke skjønner mye av det du skriver
    Ha en god påske
    Hans

    ReplyDelete
  2. Takk, god påske til deg også!

    ReplyDelete
  3. I would actually agree with you to a certain point, however I don't think you should worry to much about the speed issue upon saves of the serialized structure in MongoDB. However the search issue is troublesome issue!

    ReplyDelete