Friday, January 19, 2007

TupleSoup, Part 2

The goal of TupleSoup is to deliver a data engine that is able to store and retrieve schemaless rows with a variable length string id. For my own use, it is important that the data is contained in as few files as possible, I have thus chosen a basic layout of four files (actually only three, but lets start out by assuming four): add, update, delete and index. There is not really much fun in the first three data files. If you add a row, append it to the add file. If you update a row append it to the update file and finally, if you delete a row append its id to the delete file. In theory, we don't need the index to retrieve the data. To fetch a specific row just follow these simple rules until one matches.

  1. Check the delete file. If the rows id is in that, it has been deleted.
  2. Scan through the update file. If the row is found there, the last copy in that file will be the correct version.
  3. Scan through the add file until you find it.

This is all very very simple to implement and it will work well for small datasets. However to locate a row we may need to scan all the way through three files, two of which could be quite big as the dataset grows. This is where the index comes to use. The index will store an entry for each row id which says in which file and at which offset the row can be found. The task of locating a row has thus been reduced to looking up the id in the index and going directly to the right file at the right offset to read it.

Now that we have an index, we can also optimize the update system to achieve a lower waste of disk space without too big a penalty. By using the index, we can go straight to the current location of the row and write the new row in its space if it will fit. We don't even need to check the size of the old row first if we add the size of the row to the index. Relying fully on an index will also make the delete file redundant saving even more space and getting closer to my goal of a very simple file layout.

Unfortunately, the index isn't free. When we add or modify a row, we can no longer just append it to a file, we must also maintain the index. There are many different ways to implement an index, each of which will perform well in some areas and poor in others. For the initial version of TupleSoup, I have chosen to implement three different index types: a memory based with a journal, a flat file and a paged file. The following sub sections will describe each of these indexes and provide rudimentary benchmark results. The benchmark will measure the average time to add, fetch and update a row in a 25.000 and 50.000 row dataset. Be aware, that these tests are run on my macbook pro 17" core duo 2.16ghz with a 5400rpm disk. Its currently running at a lower speed due to a dead battery, so don't take these times for an actual performance indication.

Memory Index

The memory based index keeps an entry for each row in a hashtable making lookups insanely fast. Whenever changes to the index are made, they are appended to the index file. This will function as a journal that will allow the index to recreate itself in memory after a crash. Lets look at the test results.

Test type25.00050.000
Add0.068 ms0.069 ms
Fetch0.137 ms0.136 ms
Update1.512 ms1.970 ms

It is quite obvious from these results that the memory based index experiences a very low penalty from a growing dataset. However, to use this index type we must be able to keep the full index in memory at all times. The current implementation of the index will use 52 bytes plus four times the number of characters in your ids pr index entry. As an example, if we have a dataset with 50.000 rows, each with a an average id length of 20 characters, then the index will take up 50.000 * (52+20*4) = 6.600.000 bytes = 6.445 kb. That may not seem like very much, but you need to be 100% sure that your dataset has no way of growing beyond the memory available to TupleSoup. If that should happen, you might not be able to start TupleSoup up to access your data any more.

Flat index

The second index type is the flat index. It uses a single index file much in the same way the memory based index does. However, it does not cache anything. Instead it scans through its index file whenever a request is made. Furthermore, it does not simply append changes to rows as the memory index does. Instead it updates the existing entries in its file when they are modified.

Test type25.00050.000
Add0.237 ms0.257 ms
Fetch11.816 ms23.997 ms
Update30.014 ms60.636 ms

These numbers are pretty much as expected. Add's are very quick and are not really affected by the size of the dataset since we are still just appending to a single file. Fetches are almost 100 times slower than the memory based index and become proportionally slower as the dataset grows. Updates are only around 20 times slower than the memory based index due to a big part of the task not being index related, but they also suffer immensely from a growing dataset.

Paged Index

The problem with the flat index is that we have to scan through the full index to find the entries at the end of it. A simple approach to fix this is to divide the index file into pages of entries along with a method that can quickly help us find the right page for a given row id.

The paged index does exactly this. It uses a single file divided into pages of entries. Each page has two pointers, one to the next page and one to a page where the hashcode for the ids are lower than the first entry in itself. To scan for an entry for a row id in this index, follow these simple rules. Starting at the first page.

  1. If the hashcode for the id you are looking for is lower than the first entry in this page, start again with the page marked as lower for this page.
  2. If its equal to or higher, scan through this page. If the entry is not found here, start again with the page marked as next.

This index gives us a simple tree structure which can help us locate the needed record faster. However, to add a new entry, it is no longer enough to just append it, you must scan through the tree structure to locate the first valid page which has enough free space for the new entry. Also, there is no guarantee that this tree will be balanced. In fact, if you keep adding entries with a lower hashcode than the last, they will end up creating a new page for each entry, giving you a huge slow index. However, if entries are added in a somewhat random order with regards to hashcodes, chances are that it won't be that bad. Lets have a look at how the paged index compares to the flat index performance wise.

Test type25.00050.000
Add0.582 ms0.582 ms
Fetch13.196 ms18.430 ms
Update34.015 ms47.355 ms

As expected, add's are slower than with the flat index. They do however seem somewhat stable even when the index grows in size. The fetch is initially slower than the flat index for a small dataset, but is less affected by a growing dataset, beating the flat index in the test with 50.000 tuples. The same goes for the update test.

B-Tree Index

The primary problem with the paged index is the risk of generating a highly unbalanced tree. To improve upon this, we would have to use a self balancing structure such as a b-tree. Under normal circumstances, this wouldn't be that hard to implement, but one of the key design goals for TupleSoup is to have variable length ids. This means that each node in the tree could take up a unique number of bytes on disk, making the task of re-balancing the tree much less fun.

Unfortunately, I'm feeling lazy and the current indexes will quite happily serve my needs for now, so the b-tree will have to wait for another day :-)

1 comment:

Anonymous said...

Kasper, this is a great article! It highlights your vision on your project, but it also explains very well the general design of a database.
Good Job! Olivier - Switzerland