Friday, September 7, 2007

TupleSoup, Part 6

I have just uploaded a new release of TupleSoup. Unless you have multiple threads accessing your data, there is no real benefit for you in this release. In fact, you are even required to change a few things in your source... sorry :-( ...however, the possibilities these changes open for are really exciting, which will be what the rest of this post is about.

The primary change since the last version has been that the class Table has been renamed to DualFileTable and Table has instead been recreated as an interface. The idea behind this is that it will be possible for other parts of the system to work with any kind of table storage as long as they adhere to the Table interface. To adapt your code, all you need to do is change your instantiations to use DualFileTable, instead of Table as shown in the following sample.

Table table=new DualFileTable("test","./");

The first new type of table is also included in this release, its called HashedTable and is basically a hash of 16 separate DualFileTable objects. Whenever you insert a new row, it uses the row id to decided which of the 16 tables it should be stored in. In an ideal situation, this allows up to 16 threads to write new rows to disk at the same time without locking (remember, even in an ideal situation most of these threads would still have to fight for io on a lower level). However, the penalty is that any thread who wants to scan through the entire set of data in the table, will now have to scan through 16 sets of data instead of one. Even though these will be smaller than the one large, this will bring in a substantial overhead, so only use this new table for data storage that rarely do full table scans.

This might not sound too exciting if you are not really doing massively parallel applications, however this whole refactoring has opened up great ways for me to add future functionality. One of the primary features I have been trying to find a good solution for is value indexing. That is, to create an index on a table that indexes all values of a single key on all rows. This will make it really snappy to select a set of rows based on the values of that key in the rows. With this new design, this type of indexing can be designed as an object that implements the Table interface and simply wraps around another table, injecting the indexing functionality where it is needed.

Thats about it for this post, I hope to soon follow up with a post that provides a longer full sample of TupleSoup usage.

No comments: