Wednesday, January 24, 2007

TupleSoup, Part 4

I started the TupleSoup implementation without using an index to keep things simple so I could get a working code base as soon as possible. Then I added the index for faster random access operations. It turned out that full fetches of the whole dataset using the index is not that much slower than scanning the files without it especially with an index cache. Relying on the index for all operations removed the need for the delete file and thats just about where the implementation was left by the time I wrote part 2 of this series.

However, there is really no need for an update file either. A single data file and an index would serve my needs quite well.... except, having two files brings us a new set of possibilities especially if we start viewing them as equals functionality wise. So from now on TupleSoup will store three files for each table: file-a, file-b and an index.

The most obvious gain from this, is that it will now possible able to add and update rows in parallel, locking only on the index. This will in theory allow us to perform better when used in a multithreaded environment. But the far more exciting result from this change, is that we can now lock one file for changes without affecting the functionality of the system. If we lock file-a for changes, all adds and updates simply go to file-b. This will allow us to do maintenance work on the data files while the system is running. One such task could be to compact the data files to reclaim space lost when rows are deleted or updated. Another could be to rearrange rows to match an external data index, improving performance when reading rows based on that index.

We will have a closer look at the design and implementation of those maintenance tasks in a later post discussing data integrity and recovery, but for now, lets end this post with a look at the performance of TupleSoup with the new file-a and b. These tests are run with the paged index, 50.000 rows and an index cache of 10.000 entries. Furthermore, I finally got a new battery from Apple after my old one swelled up and died, so I am now running at full speed again making these numbers incomparable to the previous tests.

Test typeOriginalDual file
Add0.281 ms0.279 ms
Fetch5.590 ms6.317 ms
Update5.494 ms6.286 ms

There is a slight, but acceptable, penalty to bulk fetches and updates from a single thread, but the adds are about the same. I have not yet created a usable multithreaded benchmark, so the possible performance gain is still unknown ;-)

No comments: