Monday, January 29, 2007

TupleSoup, Part 5

I'm currently sitting in Heathrow airport, waiting for my connecting flight back to Copenhagen. I still have a little over 12 hours to go. If you have been through Heathrow recently, you probably have a good feel for the pain I'm suffering right now. Finding a decent seat with a power outlet nearby is almost impossible. Finding a proper meal is even harder... If only Starbucks had power outlets near their seats, I would at least be able to enjoy a nice cup of coffee and some good snacks (even though I would also go broke within a couple of hours). So, now that I'm sitting here with plenty of time and nothing important to do, why not continue my series on TupleSoup.

I have now implemented simple data data engine which allows you to store, update and retrieve data. I do have a simple RowMatcher which can be used to select a subset of rows in a table based on the values of these data (think where clause in SQL) but I have no other functionality to work with a set of data instead of single rows.

If we look at TupleSoup from a database users point of view, the next logical step will be to add functionality for working with rows using sort, join, aggregate and selection filters. The aggregation and selection filters are pretty straight forward to implement, but the sort and join functionality will take a little more work. Not because they are inherently more complex features, but because one of the nice features in TupleSoup is that we can fetch result sets that are much larger than available memory. This is accomplished by returning an iterator which reads rows as they are requested. Implementing an aggregation or selection filter that continuos to treat the data as a stream is straight forward, but implementing sort and join functionality without holding the entire dataset in memory is far harder.

The rest of this post will be spend discussing how to sort large datasets without the need to hold the whole set in memory. The basic idea is that we split the task up into smaller tasks and keep the partial results in disk based buffers. So lets first start out looking at a buffer for row data.

Row Buffer

The purpose of the row buffer is to store a set of rows and deliver them again later as an iterator in the same style as if we were accessing a table directly. The simplest implementation would simply hold the rows using a Java List and return its build in iterator when the data is needed again, but since we are going to be using this buffer specifically to avoid holding the full data set in memory, that won't work. Instead, we will use a single flat file to store the data in. Since we are only required to deliver the rows in sequence, we have no need for any kind of index functionality like the regular table. The following table shows the performance of the row buffer compared to the regular table.

Test typeTableRowBuffer
Add row0.296 ms0.023 ms
Fetch row0.177 ms0.012 ms

It looks like the row buffer is around ten times faster than the regular table. Naturally, thats not fast enough for a greedy bunch of software developers like us, so I have added a cache to it which uses a regular Java list to store all rows until a given cache size has been used up. In that way, you can set the maximum acceptable memory usage for your query and the row buffers will ensure to keep things as fast as possible within that limit.

Sorting

If you have read even a little bit about algorithms you have probably come across several sort algorithms such as bubble sort, quick sort, heap sort and radix sort. If you haven't read anything about sort algorithms, you should at least go check out bubble and heap sort.

Under normal circumstances, you would choose your sort algorithm based on its big O complexity and memory usage, but as mentioned in the introduction of this post, we can't rely on the data set being able to fit into memory in TupleSoup. Our goal is thus to choose a sort algorithm that will allow us to sort a huge dataset with a minimum of memory requirements. One algorithm that works particularly well for this is merge sort. The idea behind merge sort is that its quite easy to merge two lists that are already sorted by themselves. In the simplest implementation, you recursively split your dataset into lists until you end up with a set of lists that hold just one element. Then you start merging the lists back together until you end up with one sorted list.

Instead of dividing your initial dataset all the way down to one element lists, you can divide it into a set of lists which are each just small enough to be handled in memory, and then use another sort algorithm to sort these in memory before merging them together. This is exactly how the merge sort in TupleSoup has been implemented. You give it a maximum memory usage which it will split between cache usage for its row buffers and for in memory sorting. Then it uses the sort functionality available in Java's collection to sort small chunks in memory at a time, which it then writes out to row buffers. Once the whole dataset has been presorted in this way, it uses merge sort to merge the resulting row buffers into one final sorted rowbuffer.

The following table shows the performance of the merge sort algorithm in TupleSoup. I have added the sort functionality in Java's collection framework, which also uses merge sort, but keeps everything in memory. Each test sorts 100000 elements which in total takes up a bit more than 17 mb of space when held in memory.

Test typeTime
Collections.sort328 ms
MergeSort 4 mb10260 ms
MergeSort 2 mb11729 ms
MergeSort 1 mb15334 ms
MergeSort 512 kb17848 ms

There is an obvious benefit from having a larger cache size, but even with a good 4 mb of cache we are still much slower than the pure memory sort. One possible improvement to better utilize the cache size, is to only keep the actual values being sorted along with the id for each row. Sorting that reduced dataset and then joining it back up with the full dataset will allow us to sort far bigger chunks in memory. We will return to this in the next post which will cover join operations :-)

No comments: