Syntactic Sirup

Monday, February 26, 2007

Minimizing Id Size While Maximizing Readability

The last few posts have primarily been me bitching about things that annoy me, so I thought I would throw in a little happy-go-lucky constructive post with almost no bitching at all.

I mentioned in my last post about pretty urls that I always attempt to create my ids so they avoid using characters that are easily confused, such as I, 1, l, O and 0. Doing so is actually very easy and in this little post I will show you how I do it in Ravenous and even provide you with a ready to use implementation in Java.

The basic idea is that we convert integer ids into a string representation using only the letters and numbers that are not easily confused. For my implementation, I chose to use the following 31 characters: 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F, H, J, K, M, N, P, Q, R, S, T, U, V, W, X, Y, Z.

A positive side benefit from this, is that we will also be moving from a base 10 system to a base 31, which means that we in many cases will be seriously reducing the number of characters used for an id.

So lets start talking about the implementation by looking at the way we would convert a number into a string. First, lets build an array of chars containing our 31 character set.

char[] digits={'2', '3', '4', '5', '6',
               '7', '8', '9', 'A', 'B',
               'C', 'D', 'E', 'F', 'G',
               'H', 'J', 'K', 'M', 'N',
               'P', 'Q', 'R', 'S', 'T',
               'U', 'V', 'W', 'X', 'Y',
               'Z'};

We use this array to directly lookup the character a number between 0 and 30 should be converted into. To convert a full number into a string, we simply keep dividing it by 31 and each time adding the character representing the remainder of the division to the string. When the remaining number is 0 we are done.

String convertToString(long num){
    StringBuilder buf=new StringBuilder();
    while(num>0){
        buf.append(digits[num % 31]);
        num=num/31;
    }
    return buf.toString();
}

Easy right? Converting a string back into a number is just as easy. This time simply run through each character, convert the single character back into a number based on its position in our translation array and multiplies that with 31 raised to the power of the current character position. Sound confusing? Try imagining if we where working with a 10 based system. The value of the first character would be multiplied by 10 ^ 0 = 1, the next by 10 ^ 1 = 10, the next by 10 ^ 2 = 100 and so forth. Thus multiplying each character by the value of its position. This is the exact same thing we are doing in our base 31 system. So lets look at the code. I'm assuming that we have a method called charToInt that will convert a base 31 character into its int value from our table.

long convertToLong(String str){
    str=str.toUpperCase().trim();
    long num=0;
    for(int i=0;i<str.length();i++){
        char c=str.charAt(i);
        num+=charToInt(c)*(Math.pow(31,i));
    }
    return num;
}

Easy right? So how about that charToInt method. Well, there are several strategies you can use for implementing it, but honestly the simplest is just to create a switch statement that utilizes the numeric value of the char to switch into a statement that returns the right value.

int charToInt(char c){
    switch(c){
        case '2':return 0;
        case '3':return 1;
        ...
    }
    return 0;
}

I have omitted the last 29 cases, but you should be getting the point by now :-). The last return is purely there to satisfy the Java compiler and to provide a semi rational behavior when (not if) someone attempts to feed it an invalid id string. If you want to optimize this further, you could create an array just like the one converting from numbers to characters. All you need to do is find the largest numeric value of those 31 characters, create an array of that size and fill in the right values.

So what do these ids look like and how much space will we save. Lets end this article with a set of sample ids, the strings they would be converted into and a rounded down percentage for how much space we are saving.

Integer	String	Saved
1	3	0 %
12	E	50 %
123	Z5	33 %
1234	UA3	25 %
12345	9VE	20 %
123456	GG66	33 %
1234567	SPFC3	28 %
12345678	QQEDF	37 %
123456789	425QB6	33 %

Friday, February 23, 2007

The Rise of the URL Fascist Regime

Some time within the past couple of years, something really strange happened. Most of my web developing friends turned into a bunch of very vocal url fascists.

Urls are suddenly divided into two sharply divided groups. The urls that fit their views and the urls that do not. If your urls do not fit into the group supported by the fascist leaders, it doesn't matter what the reasons are for your url design or how great your system is. Your urls suck, thus your system sucks and you as a developer are at best hopelessly misguided.

So what is this all about?

The people I'm referring to are not graphic designers who do a bit of web development. The leaders of this regime are all programmers who feel right at home in emacs and vi. The natural assumption would thus be that this is one of the regular crusades for following standards, but thats not it at all... this is not about making urls more technically functional. This is about making the urls look pretty, thats actually what the whole topic is called - pretty urls. Not readable urls, not memorizable urls, not semantic urls, not understandable urls, not hackable urls, not short urls.... but pretty urls.

I decided to ask around a bit, to see if I could better understand what exactly divides the ugly from the pretty. It turns out that there are varying opinions about what the ultimate pretty url looks like, but everybody seems to agree on what an ugly url looks like. One of my friends gave me the following example.

http://foo.bar/cmssystem.php?page=foo&object=bar&object_id=123123&PHPSESSIONID=123123123123123&showOnlyFemales=YESPLEASE&username=FUNKY&password=MYASS&submit.x=33&submit.y=453

Ok, even I can follow that this is not pretty. I would also say that its not readable, memorizable or understandable either. However, I do think it has some good functional sides which I will return to later.

So, I tried to create some different examples of urls which I passed around to get a prettiness rating and I finally seemed to figure out the primary difference between a pretty and an ugly URL. Take a look at the following two URLs.

http://foo.bar/profile?id=12345
http://foo.bar/profile/12345/

The ruling from the fascist regime is that the first url is ugly and the second one is pretty. My first instinct was to conclude that this was a question of keeping the urls as short as possible, but it turns out that thats not it at all. The following url combines the two by adding the name of the parameter as a part of the url path.

http://foo.bar/profile/id/12345/

Almost everybody agreed that this was the prettiest of them all. Since this is also the longest of them all, that rules out the short is better theory. As can easily be deduced from the fact that the fascist leaders didn't mind the actual word id being part of the path, its not a question of exposing the names of the variables to the users. Its all a question of keeping the ? = and & characters out of the URL. Apparently these are really ugly characters.

Here are a few more examples of pretty urls given to me by people from the new regime. Notice how they are not designed to be short. Instead they do everything possible to keep all parameters in the path.

http://foo.bar/2007/feb/10/danielwinsthelottery/
http://foo.bar/types/hatchbacks/ford/fiesta/

Notice another thing here. None of these refer to data using an id. They use nearly readable names instead. It should be mentioned though that thats not a fixed rule within the pretty url regime, some people still use ids.

So, to sum it up: Pretty urls must do everything possible to avoid the regular query parameters and if it makes sense, they should be humanly readable.

Unfortunately, none of the believers in the new pretty url regime can actually give an explanation towards the evilness of the query parameters. Most people I asked could not come up with anything better than "its ugly". In fact, most of these people can't even give a good precise definition of a pretty url and even less an explanation towards why its good. In fact, the following three statements where the only direct statements I could get out of any of them (it should be noted that two of these statements where given to me in danish, so these are my translations).

Each element in the path should be a further refinement into the dataset.
The overall path should be humanly readable and understandable.
Query parameters should not be used for navigation.

Its not that I completely disagree with these ideas, but it seems that everybody has gotten so focused on the new and wonderful world made possible by brand new technology such as mod_rewrite (been there since 1996) that everything must now be done in this way or its just way too web1.0 to be acceptable.

So why am I making a big fuzz about this and calling half of my friends fascists? First of all, I'm calling them fascists because they consider the system to be more important than the individuals. That is pretty urls are more important to them than the actual websites. Secondly, I'm making a big fuzz because query parameters actually do have some good things going that shouldn't be left out just because you think a question mark or an equal sign is ugly. As long as you use proper parameter, they let the user know what the individual "magic" parameters actually are.

I'm not saying that you shouldn't put anything in the path. I am definitely all for the move from using the path to specify code into using the path to specify data. But religiously staying away from query parameters because they are ugly is just plain stupid. One of the uses for query parameters that I really like, is for view manipulation uses. For example to hold the offset and sort order of a table of data. Lets look at an example.

http://foo.bar/users?offset=300&sortby=username

You could put these parameters into the path. But by keeping them as query parameters it is clearly visibly to the user what they are and how they change as she works with the table. For more advanced users, this will also make it easier to manipulate the url directly.

So let me end this looong rant with the guidelines I use when designing urls for web applications.

The path should be readable and specify data on code.
Things that modify the view of the data should be placed in query parameters.
Avoid using characters in id's that are easily confused such as 0, O, I, 1, l.

Thursday, February 22, 2007

Regular Expressions and String Literals

As previously stated on this blog, I really love Java. I am also very much in love with regular expressions. Unfortunately, these two girls just don't get along. Sure, they will say they do, but whenever I try to get them to cooperate the fighting begins.

If you don't like Java, you will probably use this as an opportunity to say "told you so!". If you like Java but haven't worked much with regular expressions in Java, you will probably point me to the big and wonderful API for working with regular expressions and tell me to go read the documentation... surely, this API must be as nice as the rest of the Java APIs. If you both like Java and have worked seriously with regular expressions in Java, you will probably be nodding right now and saying "Ah yes... string literals".

See the problem with Java and regular expressions, isn't the functionality of its regular expression library. Its the fact that you don't have literals in the language to enter regular expressions that makes it a pain instead of a pleasure.

Let me show you an example. String objects in Java have a nice method to do replacements based on regular expressions. It takes two Strings as parameters. First the regular expression to search for, then another string to replace with. The replacement string can include $ characters to reference groups caught by the regular expression. So the following call would replace all a's with o's.

String str="Foo Bar";
str=str.replaceAll("a","o");

Simple right? So what happens when we want to replace a \ with \\ instead of an a with an o? Lets look at the regular expression part. First we need to pad the \ into a \\ to satisfy the regular expression as it sees the \ as a special character. Then we need to do the same for the Java string literal that we are using to enter it and we must naturally do that for both characters, so we end up with "\\\\" to match a single \.... and how about the replacement part? Well, the same thing applies there, except that we end up with twice the number of characters giving us the following two parameters for the replaceAll method.

str=str.replaceAll("\\\\","\\\\\\\\\");

Take a close look at that. Can you actually tell that there are exactly eight characters without using any assistance such as a finger or your mouse cursor to count them? If so, you need to reevaluate your own counting abilities, because there where actually nine characters - the last one was just thrown in to show how hard it actually is to count that many identical characters quickly. Lets take a look at the correct version for a second.

str=str.replaceAll("\\\\","\\\\\\\\");

It still looks silly doesn't it? Remember that this is just a very simple expression trying to replace a single character with two of the same kind. For this very simple example we could actually have grouped the match and referenced it as follows.

str=str.replaceAll("(\\\\)","$1$1");

However, once you start doing any serious regular expression work that includes characters that need escaping for either the regular expression, the string literal or both things get ugly very quickly. The sad thing is that it really doesn't have to be this hard, there is really no reason why we couldn't have regular expression literals in Java... perhaps even with a customizable escape character if we were really lucky.

Monday, January 29, 2007

TupleSoup, Part 5

I'm currently sitting in Heathrow airport, waiting for my connecting flight back to Copenhagen. I still have a little over 12 hours to go. If you have been through Heathrow recently, you probably have a good feel for the pain I'm suffering right now. Finding a decent seat with a power outlet nearby is almost impossible. Finding a proper meal is even harder... If only Starbucks had power outlets near their seats, I would at least be able to enjoy a nice cup of coffee and some good snacks (even though I would also go broke within a couple of hours). So, now that I'm sitting here with plenty of time and nothing important to do, why not continue my series on TupleSoup.

I have now implemented simple data data engine which allows you to store, update and retrieve data. I do have a simple RowMatcher which can be used to select a subset of rows in a table based on the values of these data (think where clause in SQL) but I have no other functionality to work with a set of data instead of single rows.

If we look at TupleSoup from a database users point of view, the next logical step will be to add functionality for working with rows using sort, join, aggregate and selection filters. The aggregation and selection filters are pretty straight forward to implement, but the sort and join functionality will take a little more work. Not because they are inherently more complex features, but because one of the nice features in TupleSoup is that we can fetch result sets that are much larger than available memory. This is accomplished by returning an iterator which reads rows as they are requested. Implementing an aggregation or selection filter that continuos to treat the data as a stream is straight forward, but implementing sort and join functionality without holding the entire dataset in memory is far harder.

The rest of this post will be spend discussing how to sort large datasets without the need to hold the whole set in memory. The basic idea is that we split the task up into smaller tasks and keep the partial results in disk based buffers. So lets first start out looking at a buffer for row data.

Row Buffer

The purpose of the row buffer is to store a set of rows and deliver them again later as an iterator in the same style as if we were accessing a table directly. The simplest implementation would simply hold the rows using a Java List and return its build in iterator when the data is needed again, but since we are going to be using this buffer specifically to avoid holding the full data set in memory, that won't work. Instead, we will use a single flat file to store the data in. Since we are only required to deliver the rows in sequence, we have no need for any kind of index functionality like the regular table. The following table shows the performance of the row buffer compared to the regular table.

Test type	Table	RowBuffer
Add row	0.296 ms	0.023 ms
Fetch row	0.177 ms	0.012 ms

It looks like the row buffer is around ten times faster than the regular table. Naturally, thats not fast enough for a greedy bunch of software developers like us, so I have added a cache to it which uses a regular Java list to store all rows until a given cache size has been used up. In that way, you can set the maximum acceptable memory usage for your query and the row buffers will ensure to keep things as fast as possible within that limit.

Sorting

If you have read even a little bit about algorithms you have probably come across several sort algorithms such as bubble sort, quick sort, heap sort and radix sort. If you haven't read anything about sort algorithms, you should at least go check out bubble and heap sort.

Under normal circumstances, you would choose your sort algorithm based on its big O complexity and memory usage, but as mentioned in the introduction of this post, we can't rely on the data set being able to fit into memory in TupleSoup. Our goal is thus to choose a sort algorithm that will allow us to sort a huge dataset with a minimum of memory requirements. One algorithm that works particularly well for this is merge sort. The idea behind merge sort is that its quite easy to merge two lists that are already sorted by themselves. In the simplest implementation, you recursively split your dataset into lists until you end up with a set of lists that hold just one element. Then you start merging the lists back together until you end up with one sorted list.

Instead of dividing your initial dataset all the way down to one element lists, you can divide it into a set of lists which are each just small enough to be handled in memory, and then use another sort algorithm to sort these in memory before merging them together. This is exactly how the merge sort in TupleSoup has been implemented. You give it a maximum memory usage which it will split between cache usage for its row buffers and for in memory sorting. Then it uses the sort functionality available in Java's collection to sort small chunks in memory at a time, which it then writes out to row buffers. Once the whole dataset has been presorted in this way, it uses merge sort to merge the resulting row buffers into one final sorted rowbuffer.

The following table shows the performance of the merge sort algorithm in TupleSoup. I have added the sort functionality in Java's collection framework, which also uses merge sort, but keeps everything in memory. Each test sorts 100000 elements which in total takes up a bit more than 17 mb of space when held in memory.

Test type	Time
Collections.sort	328 ms
MergeSort 4 mb	10260 ms
MergeSort 2 mb	11729 ms
MergeSort 1 mb	15334 ms
MergeSort 512 kb	17848 ms

There is an obvious benefit from having a larger cache size, but even with a good 4 mb of cache we are still much slower than the pure memory sort. One possible improvement to better utilize the cache size, is to only keep the actual values being sorted along with the id for each row. Sorting that reduced dataset and then joining it back up with the full dataset will allow us to sort far bigger chunks in memory. We will return to this in the next post which will cover join operations :-)

Wednesday, January 24, 2007

TupleSoup, Part 4

I started the TupleSoup implementation without using an index to keep things simple so I could get a working code base as soon as possible. Then I added the index for faster random access operations. It turned out that full fetches of the whole dataset using the index is not that much slower than scanning the files without it especially with an index cache. Relying on the index for all operations removed the need for the delete file and thats just about where the implementation was left by the time I wrote part 2 of this series.

However, there is really no need for an update file either. A single data file and an index would serve my needs quite well.... except, having two files brings us a new set of possibilities especially if we start viewing them as equals functionality wise. So from now on TupleSoup will store three files for each table: file-a, file-b and an index.

The most obvious gain from this, is that it will now possible able to add and update rows in parallel, locking only on the index. This will in theory allow us to perform better when used in a multithreaded environment. But the far more exciting result from this change, is that we can now lock one file for changes without affecting the functionality of the system. If we lock file-a for changes, all adds and updates simply go to file-b. This will allow us to do maintenance work on the data files while the system is running. One such task could be to compact the data files to reclaim space lost when rows are deleted or updated. Another could be to rearrange rows to match an external data index, improving performance when reading rows based on that index.

We will have a closer look at the design and implementation of those maintenance tasks in a later post discussing data integrity and recovery, but for now, lets end this post with a look at the performance of TupleSoup with the new file-a and b. These tests are run with the paged index, 50.000 rows and an index cache of 10.000 entries. Furthermore, I finally got a new battery from Apple after my old one swelled up and died, so I am now running at full speed again making these numbers incomparable to the previous tests.

Test type	Original	Dual file
Add	0.281 ms	0.279 ms
Fetch	5.590 ms	6.317 ms
Update	5.494 ms	6.286 ms

There is a slight, but acceptable, penalty to bulk fetches and updates from a single thread, but the adds are about the same. I have not yet created a usable multithreaded benchmark, so the possible performance gain is still unknown ;-)

Sunday, January 21, 2007

TupleSoup, Part 3

The clear lesson from my last post on TupleSoup was that the index is the primary performance bottleneck in my current implementation. This was especially evident in the update operation that needs to access the index twice if the updated row won't fit in the same space as the original. The next step for improving TupleSoup performance will thus be the implementation of an index cache. The goal is to get close to the same high performance we got from the memory based index for often used rows.

The actual caching of index entries is very easy to implement. Simply create a hashtable and use it to cache your entries using the row id as the key. The slightly trickier part is how to maintain the index so it doesn't grow above a fixed maximum size. The optimal implementation (some times referred to as a clairvoyant replacement algorithm) would simply keep the things in the index that we will need access to next. Unfortunately, to even come close to accurately predicting this, we would need a very thorough understanding of the system that is going to be using TupleSoup. That doesn't really fit well with the goal of TupleSoup being a generic data engine, so instead we must look for a generic algorithm which maintains the cache based on usage.

A good source for theory on cache algorithms is theory for page replacement algorithms. These are algorithms that are used by operating systems to choose which memory pages to keep in memory and which to swap out when new pages are read from disk. Some of these algorithms uses knowledge about how applications use memory. This is naturally not of use to us, but the ideas and concepts are very useful for designing our own caching algorithm. Unfortunately, I am going to disappoint you now, because all I will implement to begin with is a very simple least recently used algorithm. I do have some fun ideas for more advanced algorithms, but they are closely linked to value based indexes and query caching. Since we currently have neither a value based indexing mechanism or anything even resembling a query system, this will have to wait.

But for now, lets look at the least recently used algorithm. The idea is very simple. We use a fixed sized list of index entries. Whenever we use an entry, we remove it from its current position in the list and move it to the front. When we need space to add a new entry, we remove the entry currently in the back and add the new one in the front. The idea is that entries which are used often, will be "saved" and put to the front again, while entries rarely used will drop to the back and be removed.

If the cache is big enough to keep recurring entries in the list until they are used again, this can be very effective. Unfortunately, the larger the cache, the bigger the overhead will be from maintaining the cache. Lets have a quick look at the same benchmark as we used when we looked at indexing algorithms, but this time with a least recently used cache algorithm in use. We will use the big 50.000 rows dataset with a paged index and a cache size of 10.000 index entries. This wont give us any real benchmark on how the cache algorithm will perform for actual usage, but it will reveal the overhead involved when adding new entries and a slight hint of the effect in the fetch and update tests.

Test type	Original	With cache
Add	0.582 ms	0.621 ms
Fetch	18.430 ms	16.989 ms
Update	47.355 ms	23.382 ms

The add is slightly slower, but even in this very unnatural benchmark we actually end up getting a performance improvement in both fetch and update. Lets increase the size of the cache to 50.000 entries so it can hold the full dataset. This should give us an idea of the best performance we can expect from it if the usage of the dataset falls perfectly within the bounds of the least recently used algorithm.

Test type	Original	Full cache
Add	0.582 ms	0.588 ms
Fetch	18.430 ms	1.197 ms
Update	47.355 ms	14.557 ms

We are still far away from the performance of the memory based index, even in the fetch test due to the overhead of the caching algorithm, but we are definitely approaching a good usable performance.

Friday, January 19, 2007

TupleSoup, Part 2

The goal of TupleSoup is to deliver a data engine that is able to store and retrieve schemaless rows with a variable length string id. For my own use, it is important that the data is contained in as few files as possible, I have thus chosen a basic layout of four files (actually only three, but lets start out by assuming four): add, update, delete and index. There is not really much fun in the first three data files. If you add a row, append it to the add file. If you update a row append it to the update file and finally, if you delete a row append its id to the delete file. In theory, we don't need the index to retrieve the data. To fetch a specific row just follow these simple rules until one matches.

Check the delete file. If the rows id is in that, it has been deleted.
Scan through the update file. If the row is found there, the last copy in that file will be the correct version.
Scan through the add file until you find it.

This is all very very simple to implement and it will work well for small datasets. However to locate a row we may need to scan all the way through three files, two of which could be quite big as the dataset grows. This is where the index comes to use. The index will store an entry for each row id which says in which file and at which offset the row can be found. The task of locating a row has thus been reduced to looking up the id in the index and going directly to the right file at the right offset to read it.

Now that we have an index, we can also optimize the update system to achieve a lower waste of disk space without too big a penalty. By using the index, we can go straight to the current location of the row and write the new row in its space if it will fit. We don't even need to check the size of the old row first if we add the size of the row to the index. Relying fully on an index will also make the delete file redundant saving even more space and getting closer to my goal of a very simple file layout.

Unfortunately, the index isn't free. When we add or modify a row, we can no longer just append it to a file, we must also maintain the index. There are many different ways to implement an index, each of which will perform well in some areas and poor in others. For the initial version of TupleSoup, I have chosen to implement three different index types: a memory based with a journal, a flat file and a paged file. The following sub sections will describe each of these indexes and provide rudimentary benchmark results. The benchmark will measure the average time to add, fetch and update a row in a 25.000 and 50.000 row dataset. Be aware, that these tests are run on my macbook pro 17" core duo 2.16ghz with a 5400rpm disk. Its currently running at a lower speed due to a dead battery, so don't take these times for an actual performance indication.

Memory Index

The memory based index keeps an entry for each row in a hashtable making lookups insanely fast. Whenever changes to the index are made, they are appended to the index file. This will function as a journal that will allow the index to recreate itself in memory after a crash. Lets look at the test results.

Test type	25.000	50.000
Add	0.068 ms	0.069 ms
Fetch	0.137 ms	0.136 ms
Update	1.512 ms	1.970 ms

It is quite obvious from these results that the memory based index experiences a very low penalty from a growing dataset. However, to use this index type we must be able to keep the full index in memory at all times. The current implementation of the index will use 52 bytes plus four times the number of characters in your ids pr index entry. As an example, if we have a dataset with 50.000 rows, each with a an average id length of 20 characters, then the index will take up 50.000 * (52+20*4) = 6.600.000 bytes = 6.445 kb. That may not seem like very much, but you need to be 100% sure that your dataset has no way of growing beyond the memory available to TupleSoup. If that should happen, you might not be able to start TupleSoup up to access your data any more.

Flat index

The second index type is the flat index. It uses a single index file much in the same way the memory based index does. However, it does not cache anything. Instead it scans through its index file whenever a request is made. Furthermore, it does not simply append changes to rows as the memory index does. Instead it updates the existing entries in its file when they are modified.

Test type	25.000	50.000
Add	0.237 ms	0.257 ms
Fetch	11.816 ms	23.997 ms
Update	30.014 ms	60.636 ms

These numbers are pretty much as expected. Add's are very quick and are not really affected by the size of the dataset since we are still just appending to a single file. Fetches are almost 100 times slower than the memory based index and become proportionally slower as the dataset grows. Updates are only around 20 times slower than the memory based index due to a big part of the task not being index related, but they also suffer immensely from a growing dataset.

Paged Index

The problem with the flat index is that we have to scan through the full index to find the entries at the end of it. A simple approach to fix this is to divide the index file into pages of entries along with a method that can quickly help us find the right page for a given row id.

The paged index does exactly this. It uses a single file divided into pages of entries. Each page has two pointers, one to the next page and one to a page where the hashcode for the ids are lower than the first entry in itself. To scan for an entry for a row id in this index, follow these simple rules. Starting at the first page.

If the hashcode for the id you are looking for is lower than the first entry in this page, start again with the page marked as lower for this page.
If its equal to or higher, scan through this page. If the entry is not found here, start again with the page marked as next.

This index gives us a simple tree structure which can help us locate the needed record faster. However, to add a new entry, it is no longer enough to just append it, you must scan through the tree structure to locate the first valid page which has enough free space for the new entry. Also, there is no guarantee that this tree will be balanced. In fact, if you keep adding entries with a lower hashcode than the last, they will end up creating a new page for each entry, giving you a huge slow index. However, if entries are added in a somewhat random order with regards to hashcodes, chances are that it won't be that bad. Lets have a look at how the paged index compares to the flat index performance wise.

Test type	25.000	50.000
Add	0.582 ms	0.582 ms
Fetch	13.196 ms	18.430 ms
Update	34.015 ms	47.355 ms

As expected, add's are slower than with the flat index. They do however seem somewhat stable even when the index grows in size. The fetch is initially slower than the flat index for a small dataset, but is less affected by a growing dataset, beating the flat index in the test with 50.000 tuples. The same goes for the update test.

B-Tree Index

The primary problem with the paged index is the risk of generating a highly unbalanced tree. To improve upon this, we would have to use a self balancing structure such as a b-tree. Under normal circumstances, this wouldn't be that hard to implement, but one of the key design goals for TupleSoup is to have variable length ids. This means that each node in the tree could take up a unique number of bytes on disk, making the task of re-balancing the tree much less fun.

Unfortunately, I'm feeling lazy and the current indexes will quite happily serve my needs for now, so the b-tree will have to wait for another day :-)