Monday, January 29, 2007

TupleSoup, Part 5

I'm currently sitting in Heathrow airport, waiting for my connecting flight back to Copenhagen. I still have a little over 12 hours to go. If you have been through Heathrow recently, you probably have a good feel for the pain I'm suffering right now. Finding a decent seat with a power outlet nearby is almost impossible. Finding a proper meal is even harder... If only Starbucks had power outlets near their seats, I would at least be able to enjoy a nice cup of coffee and some good snacks (even though I would also go broke within a couple of hours). So, now that I'm sitting here with plenty of time and nothing important to do, why not continue my series on TupleSoup.

I have now implemented simple data data engine which allows you to store, update and retrieve data. I do have a simple RowMatcher which can be used to select a subset of rows in a table based on the values of these data (think where clause in SQL) but I have no other functionality to work with a set of data instead of single rows.

If we look at TupleSoup from a database users point of view, the next logical step will be to add functionality for working with rows using sort, join, aggregate and selection filters. The aggregation and selection filters are pretty straight forward to implement, but the sort and join functionality will take a little more work. Not because they are inherently more complex features, but because one of the nice features in TupleSoup is that we can fetch result sets that are much larger than available memory. This is accomplished by returning an iterator which reads rows as they are requested. Implementing an aggregation or selection filter that continuos to treat the data as a stream is straight forward, but implementing sort and join functionality without holding the entire dataset in memory is far harder.

The rest of this post will be spend discussing how to sort large datasets without the need to hold the whole set in memory. The basic idea is that we split the task up into smaller tasks and keep the partial results in disk based buffers. So lets first start out looking at a buffer for row data.

Row Buffer

The purpose of the row buffer is to store a set of rows and deliver them again later as an iterator in the same style as if we were accessing a table directly. The simplest implementation would simply hold the rows using a Java List and return its build in iterator when the data is needed again, but since we are going to be using this buffer specifically to avoid holding the full data set in memory, that won't work. Instead, we will use a single flat file to store the data in. Since we are only required to deliver the rows in sequence, we have no need for any kind of index functionality like the regular table. The following table shows the performance of the row buffer compared to the regular table.

Test typeTableRowBuffer
Add row0.296 ms0.023 ms
Fetch row0.177 ms0.012 ms

It looks like the row buffer is around ten times faster than the regular table. Naturally, thats not fast enough for a greedy bunch of software developers like us, so I have added a cache to it which uses a regular Java list to store all rows until a given cache size has been used up. In that way, you can set the maximum acceptable memory usage for your query and the row buffers will ensure to keep things as fast as possible within that limit.


If you have read even a little bit about algorithms you have probably come across several sort algorithms such as bubble sort, quick sort, heap sort and radix sort. If you haven't read anything about sort algorithms, you should at least go check out bubble and heap sort.

Under normal circumstances, you would choose your sort algorithm based on its big O complexity and memory usage, but as mentioned in the introduction of this post, we can't rely on the data set being able to fit into memory in TupleSoup. Our goal is thus to choose a sort algorithm that will allow us to sort a huge dataset with a minimum of memory requirements. One algorithm that works particularly well for this is merge sort. The idea behind merge sort is that its quite easy to merge two lists that are already sorted by themselves. In the simplest implementation, you recursively split your dataset into lists until you end up with a set of lists that hold just one element. Then you start merging the lists back together until you end up with one sorted list.

Instead of dividing your initial dataset all the way down to one element lists, you can divide it into a set of lists which are each just small enough to be handled in memory, and then use another sort algorithm to sort these in memory before merging them together. This is exactly how the merge sort in TupleSoup has been implemented. You give it a maximum memory usage which it will split between cache usage for its row buffers and for in memory sorting. Then it uses the sort functionality available in Java's collection to sort small chunks in memory at a time, which it then writes out to row buffers. Once the whole dataset has been presorted in this way, it uses merge sort to merge the resulting row buffers into one final sorted rowbuffer.

The following table shows the performance of the merge sort algorithm in TupleSoup. I have added the sort functionality in Java's collection framework, which also uses merge sort, but keeps everything in memory. Each test sorts 100000 elements which in total takes up a bit more than 17 mb of space when held in memory.

Test typeTime
Collections.sort328 ms
MergeSort 4 mb10260 ms
MergeSort 2 mb11729 ms
MergeSort 1 mb15334 ms
MergeSort 512 kb17848 ms

There is an obvious benefit from having a larger cache size, but even with a good 4 mb of cache we are still much slower than the pure memory sort. One possible improvement to better utilize the cache size, is to only keep the actual values being sorted along with the id for each row. Sorting that reduced dataset and then joining it back up with the full dataset will allow us to sort far bigger chunks in memory. We will return to this in the next post which will cover join operations :-)

Wednesday, January 24, 2007

TupleSoup, Part 4

I started the TupleSoup implementation without using an index to keep things simple so I could get a working code base as soon as possible. Then I added the index for faster random access operations. It turned out that full fetches of the whole dataset using the index is not that much slower than scanning the files without it especially with an index cache. Relying on the index for all operations removed the need for the delete file and thats just about where the implementation was left by the time I wrote part 2 of this series.

However, there is really no need for an update file either. A single data file and an index would serve my needs quite well.... except, having two files brings us a new set of possibilities especially if we start viewing them as equals functionality wise. So from now on TupleSoup will store three files for each table: file-a, file-b and an index.

The most obvious gain from this, is that it will now possible able to add and update rows in parallel, locking only on the index. This will in theory allow us to perform better when used in a multithreaded environment. But the far more exciting result from this change, is that we can now lock one file for changes without affecting the functionality of the system. If we lock file-a for changes, all adds and updates simply go to file-b. This will allow us to do maintenance work on the data files while the system is running. One such task could be to compact the data files to reclaim space lost when rows are deleted or updated. Another could be to rearrange rows to match an external data index, improving performance when reading rows based on that index.

We will have a closer look at the design and implementation of those maintenance tasks in a later post discussing data integrity and recovery, but for now, lets end this post with a look at the performance of TupleSoup with the new file-a and b. These tests are run with the paged index, 50.000 rows and an index cache of 10.000 entries. Furthermore, I finally got a new battery from Apple after my old one swelled up and died, so I am now running at full speed again making these numbers incomparable to the previous tests.

Test typeOriginalDual file
Add0.281 ms0.279 ms
Fetch5.590 ms6.317 ms
Update5.494 ms6.286 ms

There is a slight, but acceptable, penalty to bulk fetches and updates from a single thread, but the adds are about the same. I have not yet created a usable multithreaded benchmark, so the possible performance gain is still unknown ;-)

Sunday, January 21, 2007

TupleSoup, Part 3

The clear lesson from my last post on TupleSoup was that the index is the primary performance bottleneck in my current implementation. This was especially evident in the update operation that needs to access the index twice if the updated row won't fit in the same space as the original. The next step for improving TupleSoup performance will thus be the implementation of an index cache. The goal is to get close to the same high performance we got from the memory based index for often used rows.

The actual caching of index entries is very easy to implement. Simply create a hashtable and use it to cache your entries using the row id as the key. The slightly trickier part is how to maintain the index so it doesn't grow above a fixed maximum size. The optimal implementation (some times referred to as a clairvoyant replacement algorithm) would simply keep the things in the index that we will need access to next. Unfortunately, to even come close to accurately predicting this, we would need a very thorough understanding of the system that is going to be using TupleSoup. That doesn't really fit well with the goal of TupleSoup being a generic data engine, so instead we must look for a generic algorithm which maintains the cache based on usage.

A good source for theory on cache algorithms is theory for page replacement algorithms. These are algorithms that are used by operating systems to choose which memory pages to keep in memory and which to swap out when new pages are read from disk. Some of these algorithms uses knowledge about how applications use memory. This is naturally not of use to us, but the ideas and concepts are very useful for designing our own caching algorithm. Unfortunately, I am going to disappoint you now, because all I will implement to begin with is a very simple least recently used algorithm. I do have some fun ideas for more advanced algorithms, but they are closely linked to value based indexes and query caching. Since we currently have neither a value based indexing mechanism or anything even resembling a query system, this will have to wait.

But for now, lets look at the least recently used algorithm. The idea is very simple. We use a fixed sized list of index entries. Whenever we use an entry, we remove it from its current position in the list and move it to the front. When we need space to add a new entry, we remove the entry currently in the back and add the new one in the front. The idea is that entries which are used often, will be "saved" and put to the front again, while entries rarely used will drop to the back and be removed.

If the cache is big enough to keep recurring entries in the list until they are used again, this can be very effective. Unfortunately, the larger the cache, the bigger the overhead will be from maintaining the cache. Lets have a quick look at the same benchmark as we used when we looked at indexing algorithms, but this time with a least recently used cache algorithm in use. We will use the big 50.000 rows dataset with a paged index and a cache size of 10.000 index entries. This wont give us any real benchmark on how the cache algorithm will perform for actual usage, but it will reveal the overhead involved when adding new entries and a slight hint of the effect in the fetch and update tests.

Test typeOriginalWith cache
Add0.582 ms0.621 ms
Fetch18.430 ms16.989 ms
Update47.355 ms23.382 ms

The add is slightly slower, but even in this very unnatural benchmark we actually end up getting a performance improvement in both fetch and update. Lets increase the size of the cache to 50.000 entries so it can hold the full dataset. This should give us an idea of the best performance we can expect from it if the usage of the dataset falls perfectly within the bounds of the least recently used algorithm.

Test typeOriginalFull cache
Add0.582 ms0.588 ms
Fetch18.430 ms1.197 ms
Update47.355 ms14.557 ms

We are still far away from the performance of the memory based index, even in the fetch test due to the overhead of the caching algorithm, but we are definitely approaching a good usable performance.

Friday, January 19, 2007

TupleSoup, Part 2

The goal of TupleSoup is to deliver a data engine that is able to store and retrieve schemaless rows with a variable length string id. For my own use, it is important that the data is contained in as few files as possible, I have thus chosen a basic layout of four files (actually only three, but lets start out by assuming four): add, update, delete and index. There is not really much fun in the first three data files. If you add a row, append it to the add file. If you update a row append it to the update file and finally, if you delete a row append its id to the delete file. In theory, we don't need the index to retrieve the data. To fetch a specific row just follow these simple rules until one matches.

  1. Check the delete file. If the rows id is in that, it has been deleted.
  2. Scan through the update file. If the row is found there, the last copy in that file will be the correct version.
  3. Scan through the add file until you find it.

This is all very very simple to implement and it will work well for small datasets. However to locate a row we may need to scan all the way through three files, two of which could be quite big as the dataset grows. This is where the index comes to use. The index will store an entry for each row id which says in which file and at which offset the row can be found. The task of locating a row has thus been reduced to looking up the id in the index and going directly to the right file at the right offset to read it.

Now that we have an index, we can also optimize the update system to achieve a lower waste of disk space without too big a penalty. By using the index, we can go straight to the current location of the row and write the new row in its space if it will fit. We don't even need to check the size of the old row first if we add the size of the row to the index. Relying fully on an index will also make the delete file redundant saving even more space and getting closer to my goal of a very simple file layout.

Unfortunately, the index isn't free. When we add or modify a row, we can no longer just append it to a file, we must also maintain the index. There are many different ways to implement an index, each of which will perform well in some areas and poor in others. For the initial version of TupleSoup, I have chosen to implement three different index types: a memory based with a journal, a flat file and a paged file. The following sub sections will describe each of these indexes and provide rudimentary benchmark results. The benchmark will measure the average time to add, fetch and update a row in a 25.000 and 50.000 row dataset. Be aware, that these tests are run on my macbook pro 17" core duo 2.16ghz with a 5400rpm disk. Its currently running at a lower speed due to a dead battery, so don't take these times for an actual performance indication.

Memory Index

The memory based index keeps an entry for each row in a hashtable making lookups insanely fast. Whenever changes to the index are made, they are appended to the index file. This will function as a journal that will allow the index to recreate itself in memory after a crash. Lets look at the test results.

Test type25.00050.000
Add0.068 ms0.069 ms
Fetch0.137 ms0.136 ms
Update1.512 ms1.970 ms

It is quite obvious from these results that the memory based index experiences a very low penalty from a growing dataset. However, to use this index type we must be able to keep the full index in memory at all times. The current implementation of the index will use 52 bytes plus four times the number of characters in your ids pr index entry. As an example, if we have a dataset with 50.000 rows, each with a an average id length of 20 characters, then the index will take up 50.000 * (52+20*4) = 6.600.000 bytes = 6.445 kb. That may not seem like very much, but you need to be 100% sure that your dataset has no way of growing beyond the memory available to TupleSoup. If that should happen, you might not be able to start TupleSoup up to access your data any more.

Flat index

The second index type is the flat index. It uses a single index file much in the same way the memory based index does. However, it does not cache anything. Instead it scans through its index file whenever a request is made. Furthermore, it does not simply append changes to rows as the memory index does. Instead it updates the existing entries in its file when they are modified.

Test type25.00050.000
Add0.237 ms0.257 ms
Fetch11.816 ms23.997 ms
Update30.014 ms60.636 ms

These numbers are pretty much as expected. Add's are very quick and are not really affected by the size of the dataset since we are still just appending to a single file. Fetches are almost 100 times slower than the memory based index and become proportionally slower as the dataset grows. Updates are only around 20 times slower than the memory based index due to a big part of the task not being index related, but they also suffer immensely from a growing dataset.

Paged Index

The problem with the flat index is that we have to scan through the full index to find the entries at the end of it. A simple approach to fix this is to divide the index file into pages of entries along with a method that can quickly help us find the right page for a given row id.

The paged index does exactly this. It uses a single file divided into pages of entries. Each page has two pointers, one to the next page and one to a page where the hashcode for the ids are lower than the first entry in itself. To scan for an entry for a row id in this index, follow these simple rules. Starting at the first page.

  1. If the hashcode for the id you are looking for is lower than the first entry in this page, start again with the page marked as lower for this page.
  2. If its equal to or higher, scan through this page. If the entry is not found here, start again with the page marked as next.

This index gives us a simple tree structure which can help us locate the needed record faster. However, to add a new entry, it is no longer enough to just append it, you must scan through the tree structure to locate the first valid page which has enough free space for the new entry. Also, there is no guarantee that this tree will be balanced. In fact, if you keep adding entries with a lower hashcode than the last, they will end up creating a new page for each entry, giving you a huge slow index. However, if entries are added in a somewhat random order with regards to hashcodes, chances are that it won't be that bad. Lets have a look at how the paged index compares to the flat index performance wise.

Test type25.00050.000
Add0.582 ms0.582 ms
Fetch13.196 ms18.430 ms
Update34.015 ms47.355 ms

As expected, add's are slower than with the flat index. They do however seem somewhat stable even when the index grows in size. The fetch is initially slower than the flat index for a small dataset, but is less affected by a growing dataset, beating the flat index in the test with 50.000 tuples. The same goes for the update test.

B-Tree Index

The primary problem with the paged index is the risk of generating a highly unbalanced tree. To improve upon this, we would have to use a self balancing structure such as a b-tree. Under normal circumstances, this wouldn't be that hard to implement, but one of the key design goals for TupleSoup is to have variable length ids. This means that each node in the tree could take up a unique number of bytes on disk, making the task of re-balancing the tree much less fun.

Unfortunately, I'm feeling lazy and the current indexes will quite happily serve my needs for now, so the b-tree will have to wait for another day :-)

Saturday, January 13, 2007

TupleSoup, Part 1

I am in desperate need of a small embedable database for a project at work. It would be far too easy to just pick one among the many excellent open source Java databases already available (H2, Derby, tinySQL, Mckoi, QED etc.), so naturally I am writing my own. As I work on this project, I intend to regularly post about my design and implementation choices on this blog. The complete code will be posted on sourceforge as soon as it is usable to anyone but me in any way.

If the design and implementation of a small database sounds appealing to you, why not follow me here while I do my very best to document the process in the most entertaining way possible... ok, that might be overstating it a bit - but, if you haven't written one yourself yet, you might want to read along to learn a bit from my mistakes and if you have already written one or two, you might want to stick around to get a good laugh out of those same mistakes. So let me start out by introducing my vision for what this little database should be able to do.

Last year I read Andrew Hitchcock's excellent post about the talk Jeff Dean gave at the University of Washington on Google Bigtable. It was later followed up with an official article (If you have any interest in alternative databases at all, I highly recommend that you read the article, or at least Andrew Hitchcock's blog post). That article really got me thinking about the possibilities of using a simple schemaless database with variable length strings as id's for records for my daily work. I had previously really enjoyed implementing a simple schemaless database, which I have long been thinking about rewriting. So thats where I am at now. TupleSoup will be the core data engine in a new database I am going to start writing (I hope I will also be able to open source that database, but I am currently discussing that with my employer). It will be schemaless and rows will be stored and fetched based on variable length string ids.

I could go on ranting for hours about how I would like to implement this data engine and how I would like to be able to use it, but instead I'm going to stop blabbering now and end this post with a few lines of code demonstrating how I imagine that the finished product would be used.

Table tbl=new Table("datafile",".");

Row row=new Row("kapper");
row.put("name","Kasper J. Jeppesen");


Tuesday, January 9, 2007

Why I still love Java

I do most of my large scale software development in Java and all of my scripting in Ruby. Its a wonderful combination that keeps me happy and productive.

Mentioning Ruby is greeted by most other developers with smiling nods of approval - its young, trendy and easy to love - however, mentioning that I use Java as my main language gets very mixed reactions. Everything from pointing and laughing to a direct punch to the face is possible. Ok, maybe not a direct punch, but at least a girlish slap or a really mean insult.

Unfortunately, explaining what I love about Java can't be done in a single sentence or two as a quick response to "Your language sucks!". It takes vast amounts of time and will often require me to peel away layers upon layers of prejudice. Since I am by nature a very lazy person I have decided to write this post so I can simply direct people here in the future. I'm sure that the number of Java hating people who will actually read this post will be very very close to zero, but thats an insignificant fact I can easily trick myself into forgetting.

Lets first get the most common type of prejudice I encounter as a Java developer out of the way. No, Java is not the only programming language I know and yes I have developed plenty of larger projects in languages without garbage collection - even in C. In fact, let me take this opertunity to bore you with the long and dull tale of my background as a software developer.

I took my first steps as a software developer in the mid eighties with Logo on my dads Apple II. The fact that I could make that little turtle on the screen do anything I wanted was so incredibly alluring. Yes, I know it was just a little triangle on a low resolution monochrome screen, but I was not even ten at the time and had never seen a real turtle. To me the experience was as close to having my own little reptilian slave, as I could imagine.... see, this is the kind of power trip functional programming will bring forth in young minds!

A couple of years later I moved with my mom and sister to Saudi Arabia. We lived near a hospital where my mom worked. Several of her friends among the local doctors helped out teaching me and my sister english, math and physics. Two of them were very interested in computers and soon introduced me to BASIC on a ZX Spectrum and a lunchbox sized Sharp PC7000. This opened up a whole new world for me. Not only was basic a fun language to learn for an 11 year old boy, but I now had two enthusiastic intelligent programmers who would gladly spend their time sharing their knowledge. This was so addicting that I had to get my own computer programmable in Basic. I ended up getting a small Casio fx-785P calculator. Not only was it programmable in BASIC, but it also had a rudimentary assembler and came with a small booklet introducing me to the concepts of assembly programming.

After we returned to Denmark in the late eighties, I finally got my own full sized computer - an Amstrad CPC 6128 which let me move on to Pascal as my next language. The structure and encapsulation that Pascal brought to my programs allowed me to progress in the size and scope of my projects.

In 1991 I went to Canada for a year as an exchange student. Here I excitedly enrolled in the computer science class at the local high school and was greatly dissapointed when I completed the years worth of material in about two months without really learning anything new. Luckily, my teacher in this class was just as excited to actually have a student with a passion for programming that he gladly spent the rest of the year introducing me to C and C++.

Upon returning to Denmark, I sold my Amstrad and bought an Amiga 500 which would once again bring me back to assembly programming, this time on the far more elegant Motorola 68000 CPU. The early nineties was an amazing time to be a programmer on the Amiga in europe. I was no longer alone as an enthusiastic amateur programmer, but surrounded by other young people competing in demo competitions all over europe.

In 1995 I started studying for my masters degree in computer science at a danish university (I never actually completed my masters.. I think I am still lacking about half a semesters worth of non cs related course work). I never really felt that the actual courses at the university gave me that much, but the vast amounts of information and topics that it introduced me to moved me further as a programmer than anything else had in the previous ten years. I started learning new languages as a hobby and soon developed a serious crush on the simplicity of the Forth core, the object oriented beauty of Smalltalk and the strangely different Prolog. Unfortunately, none of these easily lovable languages were able to keep up with all of my needs. So I stayed faithfull to C++.

By the end of 1995 I discovered Java. It was slow as hell, but strangely appealing. I started out using it purely for applets, which brought back a lot of the excitement of the demo scene on the Amiga. I even created a small applet (yeah, I know, that thing looks ridiculous today, especially since I didn't time sync it) for the Java intro competition at The Party in 1997. I was having more fun programming than I had had in a long time. I was still trying out new languages continouesly, but I never really found anything that was a much fun as Java at the time.

As time passed, so did the speed of Java and the size of its bundled API. If Sun hadn't kept focus on improving the Java platform, I'm sure I would have left it many years ago, but somehow it kept feeling like the development of the Java platform was keeping pace with my needs as a programmer. So it was only natural for me to keep using it as my primary language. Thats basically the story of how I ended up with Java as my primary language, which brings us to the actual point of this post. What it is that still makes me love Java today. There are four primary features that currently keep Java as my primary language.

A modern garbage collector
I have no problems writing well functioning code in languages without garbage collection or with manually assisted garbage collection such as pre-leopard Cocoa. However, with the fast well functioning garbage collector in Java, I feel like I am wasting my time whenever I handle these things manually in other languages.

A full featured consistent API
One of the things that finally made me drop PHP as a web development platform was the consistently inconsistent API. In Java the standard API is consistent enough that I can often guess how it works and how methods will be named.

Good performance
I don't necessarily need to be working in the fastest language/API combo possible, but performance is a factor for most of the software I develop so it needs to be close enough to the top, that I can remain competitive.

A good security model
There seems to be a direct correlation between the size of my programmer-ego and the number of mistakes I assume I won't make. By choosing a language with a good security model I can spend less time worrying about the quality of my code and more time bragging about my coding skills.

Java satisfies these four demands quite well. But, honestly any language/API combo that would satisfy these demands just as well for my intended target platform would work for me. I have no religious attachments to Java, but the fact that it has evolved hand in hand with my needs over the past 11 years and kept my life as a programmer fun has kept it as my primary language.

Monday, January 8, 2007

Syntactic Sirup

It's quite obvious that the world lacks yet another variation of the term syntactic sugar. Let me thus introduce a new one with the naming of this blog.

Syntactic Sirup
An addition to the syntax of a language which at first seems wonderful and sweet, but will soon leave your code in a sticky mess that is more easily rewritten than cleaned up.