Syntactic Sirup: 2007

Friday, October 26, 2007

TaskPaper

I recently started using a small app for the mac called TaskPaper. Its basically a small todo list/outliner app, but with a big difference that sets it apart from the more complex competing products. Its file format is just plain text.... in fact, its so plain that it's probably very similar to what you would use if you maintained the same kind of list using a plain text editor. Let me show you a little sample from my own todo list at work.

Software update service:
- Add public/private key sign. to connections @code
- Add checksum after download and update @code
- Add temporary file cleanup @code
- Fix load issue with server side sockets @bug
- Write readme file @doc

Thats really all there is to it. Having the file format this simple makes it possible to share your task lists using code repositories such as subversion with all the same benefits as regular source code files. In other words, its the perfect tool for maintaining todo lists in your code projects :-)

I could keep on blabbering about how nice and simple the interface is, or how cool the TextMate bundle for TaskPaper files is, but really... download the trial version and give it a spin. Remember, if you end up not buying the app when your trial is over, all of your work will be perfectly accessible in the plain text files TaskPaper stores its data in.

Friday, September 7, 2007

TupleSoup, Part 6

I have just uploaded a new release of TupleSoup. Unless you have multiple threads accessing your data, there is no real benefit for you in this release. In fact, you are even required to change a few things in your source... sorry :-( ...however, the possibilities these changes open for are really exciting, which will be what the rest of this post is about.

The primary change since the last version has been that the class Table has been renamed to DualFileTable and Table has instead been recreated as an interface. The idea behind this is that it will be possible for other parts of the system to work with any kind of table storage as long as they adhere to the Table interface. To adapt your code, all you need to do is change your instantiations to use DualFileTable, instead of Table as shown in the following sample.

Table table=new DualFileTable("test","./");

The first new type of table is also included in this release, its called HashedTable and is basically a hash of 16 separate DualFileTable objects. Whenever you insert a new row, it uses the row id to decided which of the 16 tables it should be stored in. In an ideal situation, this allows up to 16 threads to write new rows to disk at the same time without locking (remember, even in an ideal situation most of these threads would still have to fight for io on a lower level). However, the penalty is that any thread who wants to scan through the entire set of data in the table, will now have to scan through 16 sets of data instead of one. Even though these will be smaller than the one large, this will bring in a substantial overhead, so only use this new table for data storage that rarely do full table scans.

This might not sound too exciting if you are not really doing massively parallel applications, however this whole refactoring has opened up great ways for me to add future functionality. One of the primary features I have been trying to find a good solution for is value indexing. That is, to create an index on a table that indexes all values of a single key on all rows. This will make it really snappy to select a set of rows based on the values of that key in the rows. With this new design, this type of indexing can be designed as an object that implements the Table interface and simply wraps around another table, injecting the indexing functionality where it is needed.

Thats about it for this post, I hope to soon follow up with a post that provides a longer full sample of TupleSoup usage.

Monday, August 27, 2007

TupleSoup Public Release

Tuplesoup is a small easy to use Java based framework for storing and retrieving simple hashes. If you are interested in the design and implementation choices I made, have a look at the following blog posts I made while working on the project.

Part 1, Introduction
Part 2, Record index
Part 3, Index caching
Part 4, First refactoring
Part 5, Sorting

Since the last post in this series, I have done some further refactoring as well as added a set of statistic counters that can be polled to look for performance issues in usage patterns. The project is far from done, but I feel that it has reached a maturity that might make it useful for other developers. We have currently been using Tuplesoup for projects in our production environment for more than 3 months now without any stability issues. That being said, use at your own risk!

Let me end this post with a little bit of source code to show you how to actually use Tuplesoup in your own code. This very short completely useless example creates a table, adds a row and retrieves the row again. Hopefully I will soon find the time to write a more detailed piece of code showing some real world usage.

import com.solidosystems.tuplesoup.core.*;

Table table=new Table("test","./");

Row row=new Row("1");
row.put("username","kasperjj");
row.put("floatNumber",3.141592);
table.addRow(row);

row=table.getRow("1");
String username=row.getString("username");

Tuplesoup is available through sourceforge and has been released under the BSD license.

Friday, August 24, 2007

GTAC 07

I'm currently sitting in google's offices in New York for a small two day invitation-only conference called Google Test Automation Conference 07. This year, the primary theme seems to be automated testing of web interfaces using either WebDriver or Selenium.

The overall content of the conference so far has been very good, except for a few really boring sessions... however, since I'm primarily a developer and not directly within qa, I might just be too far away from the target audience for those sessions. This morning started off with a really interesting talk about using domain specific languages to write Selenium tests. It was very well delivered, and the content was definitely interesting, but at the same time I just couldn't help but notice how most of their samples in their dsl simply brought the interface of Selenium closer to the native interface for WebDriver. I haven't actually tried using either of the two systems yet, so this is purely prejudiced speculation from my side :-)

Although the rest of the day is filled with more talks about Selenium which all sound really interesting, I'm really starting to look forward to a lightning talk at the end of the day called "Selenium vs WebDriver steel cage knife fight". It will be featuring the authors of the two projects and I intend to be right up there by the edge of the cage ready to throw bacon at them as the fight begins!

Thursday, August 23, 2007

Lunch at GTAC 07

Ok... I take it all back, lunch at google today was so incredibly yummy that I completely forgive the missing coffee this morning. Seriously, even after reading blog post after blog post of people raving about the food at google, I was in no way prepared for the actual deliciousness of it :-)

Now I just gotta find a way to sneak back into the offices around lunchtime every day... or maybe I could permanently occupy a stall in the bathroom and just come out once a day for lunch...

Caffeine free GTAC 07

I just arrived at googles offices in new york for a conference on testing automation. The invitation said breakfast included so I slept in late and went directly from my bed to the L train towards manhattan. The breakfast is fine, except for one important element... no more coffee... seriously, nothing but decaf :-(

How can you start a conference at 8:30 in the morning and run out of coffee at 8:35 ???

Wednesday, July 4, 2007

Netstat Sucks!!!

Well.. at least the following version of netstat sucks:


netstat 1.42 (2001-04-15)
Fred Baumgarten, Alan Cox, Bernd Eckenfels, Phil Blundell, Tuan Hoang and others

At first glimpse, netstat works just fine. In fact, if you have less than 12 digits in the ip addresses you are looking at, you will probably never notice a problem. Unfortunately, the servers I'm usually working on has 12 digits. Have a look at the following output line from netstat.


tcp 0 0 ::ffff:217.116.236.191:25 ::ffff:217.116.236.18:40280 ESTABLISHED

It looks like a connection from 217.116.236.18 to 217.116.236.191. Unfortunately, thats not the case. In this case, it was a connection from .185. So what happened? Well, somebody in the group of people who wrote netstat, decided that it was more important to have nice and pretty looking columns than correct data. So they actually crop the ip address if the ip and port data doesn't fit the column width.

I really can't see any good reason for doing this.... netstat sucks!

Monday, February 26, 2007

Minimizing Id Size While Maximizing Readability

The last few posts have primarily been me bitching about things that annoy me, so I thought I would throw in a little happy-go-lucky constructive post with almost no bitching at all.

I mentioned in my last post about pretty urls that I always attempt to create my ids so they avoid using characters that are easily confused, such as I, 1, l, O and 0. Doing so is actually very easy and in this little post I will show you how I do it in Ravenous and even provide you with a ready to use implementation in Java.

The basic idea is that we convert integer ids into a string representation using only the letters and numbers that are not easily confused. For my implementation, I chose to use the following 31 characters: 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F, H, J, K, M, N, P, Q, R, S, T, U, V, W, X, Y, Z.

A positive side benefit from this, is that we will also be moving from a base 10 system to a base 31, which means that we in many cases will be seriously reducing the number of characters used for an id.

So lets start talking about the implementation by looking at the way we would convert a number into a string. First, lets build an array of chars containing our 31 character set.

char[] digits={'2', '3', '4', '5', '6',
               '7', '8', '9', 'A', 'B',
               'C', 'D', 'E', 'F', 'G',
               'H', 'J', 'K', 'M', 'N',
               'P', 'Q', 'R', 'S', 'T',
               'U', 'V', 'W', 'X', 'Y',
               'Z'};

We use this array to directly lookup the character a number between 0 and 30 should be converted into. To convert a full number into a string, we simply keep dividing it by 31 and each time adding the character representing the remainder of the division to the string. When the remaining number is 0 we are done.

String convertToString(long num){
    StringBuilder buf=new StringBuilder();
    while(num>0){
        buf.append(digits[num % 31]);
        num=num/31;
    }
    return buf.toString();
}

Easy right? Converting a string back into a number is just as easy. This time simply run through each character, convert the single character back into a number based on its position in our translation array and multiplies that with 31 raised to the power of the current character position. Sound confusing? Try imagining if we where working with a 10 based system. The value of the first character would be multiplied by 10 ^ 0 = 1, the next by 10 ^ 1 = 10, the next by 10 ^ 2 = 100 and so forth. Thus multiplying each character by the value of its position. This is the exact same thing we are doing in our base 31 system. So lets look at the code. I'm assuming that we have a method called charToInt that will convert a base 31 character into its int value from our table.

long convertToLong(String str){
    str=str.toUpperCase().trim();
    long num=0;
    for(int i=0;i<str.length();i++){
        char c=str.charAt(i);
        num+=charToInt(c)*(Math.pow(31,i));
    }
    return num;
}

Easy right? So how about that charToInt method. Well, there are several strategies you can use for implementing it, but honestly the simplest is just to create a switch statement that utilizes the numeric value of the char to switch into a statement that returns the right value.

int charToInt(char c){
    switch(c){
        case '2':return 0;
        case '3':return 1;
        ...
    }
    return 0;
}

I have omitted the last 29 cases, but you should be getting the point by now :-). The last return is purely there to satisfy the Java compiler and to provide a semi rational behavior when (not if) someone attempts to feed it an invalid id string. If you want to optimize this further, you could create an array just like the one converting from numbers to characters. All you need to do is find the largest numeric value of those 31 characters, create an array of that size and fill in the right values.

So what do these ids look like and how much space will we save. Lets end this article with a set of sample ids, the strings they would be converted into and a rounded down percentage for how much space we are saving.

Integer	String	Saved
1	3	0 %
12	E	50 %
123	Z5	33 %
1234	UA3	25 %
12345	9VE	20 %
123456	GG66	33 %
1234567	SPFC3	28 %
12345678	QQEDF	37 %
123456789	425QB6	33 %

Friday, February 23, 2007

The Rise of the URL Fascist Regime

Some time within the past couple of years, something really strange happened. Most of my web developing friends turned into a bunch of very vocal url fascists.

Urls are suddenly divided into two sharply divided groups. The urls that fit their views and the urls that do not. If your urls do not fit into the group supported by the fascist leaders, it doesn't matter what the reasons are for your url design or how great your system is. Your urls suck, thus your system sucks and you as a developer are at best hopelessly misguided.

So what is this all about?

The people I'm referring to are not graphic designers who do a bit of web development. The leaders of this regime are all programmers who feel right at home in emacs and vi. The natural assumption would thus be that this is one of the regular crusades for following standards, but thats not it at all... this is not about making urls more technically functional. This is about making the urls look pretty, thats actually what the whole topic is called - pretty urls. Not readable urls, not memorizable urls, not semantic urls, not understandable urls, not hackable urls, not short urls.... but pretty urls.

I decided to ask around a bit, to see if I could better understand what exactly divides the ugly from the pretty. It turns out that there are varying opinions about what the ultimate pretty url looks like, but everybody seems to agree on what an ugly url looks like. One of my friends gave me the following example.

http://foo.bar/cmssystem.php?page=foo&object=bar&object_id=123123&PHPSESSIONID=123123123123123&showOnlyFemales=YESPLEASE&username=FUNKY&password=MYASS&submit.x=33&submit.y=453

Ok, even I can follow that this is not pretty. I would also say that its not readable, memorizable or understandable either. However, I do think it has some good functional sides which I will return to later.

So, I tried to create some different examples of urls which I passed around to get a prettiness rating and I finally seemed to figure out the primary difference between a pretty and an ugly URL. Take a look at the following two URLs.

http://foo.bar/profile?id=12345
http://foo.bar/profile/12345/

The ruling from the fascist regime is that the first url is ugly and the second one is pretty. My first instinct was to conclude that this was a question of keeping the urls as short as possible, but it turns out that thats not it at all. The following url combines the two by adding the name of the parameter as a part of the url path.

http://foo.bar/profile/id/12345/

Almost everybody agreed that this was the prettiest of them all. Since this is also the longest of them all, that rules out the short is better theory. As can easily be deduced from the fact that the fascist leaders didn't mind the actual word id being part of the path, its not a question of exposing the names of the variables to the users. Its all a question of keeping the ? = and & characters out of the URL. Apparently these are really ugly characters.

Here are a few more examples of pretty urls given to me by people from the new regime. Notice how they are not designed to be short. Instead they do everything possible to keep all parameters in the path.

http://foo.bar/2007/feb/10/danielwinsthelottery/
http://foo.bar/types/hatchbacks/ford/fiesta/

Notice another thing here. None of these refer to data using an id. They use nearly readable names instead. It should be mentioned though that thats not a fixed rule within the pretty url regime, some people still use ids.

So, to sum it up: Pretty urls must do everything possible to avoid the regular query parameters and if it makes sense, they should be humanly readable.

Unfortunately, none of the believers in the new pretty url regime can actually give an explanation towards the evilness of the query parameters. Most people I asked could not come up with anything better than "its ugly". In fact, most of these people can't even give a good precise definition of a pretty url and even less an explanation towards why its good. In fact, the following three statements where the only direct statements I could get out of any of them (it should be noted that two of these statements where given to me in danish, so these are my translations).

Each element in the path should be a further refinement into the dataset.
The overall path should be humanly readable and understandable.
Query parameters should not be used for navigation.

Its not that I completely disagree with these ideas, but it seems that everybody has gotten so focused on the new and wonderful world made possible by brand new technology such as mod_rewrite (been there since 1996) that everything must now be done in this way or its just way too web1.0 to be acceptable.

So why am I making a big fuzz about this and calling half of my friends fascists? First of all, I'm calling them fascists because they consider the system to be more important than the individuals. That is pretty urls are more important to them than the actual websites. Secondly, I'm making a big fuzz because query parameters actually do have some good things going that shouldn't be left out just because you think a question mark or an equal sign is ugly. As long as you use proper parameter, they let the user know what the individual "magic" parameters actually are.

I'm not saying that you shouldn't put anything in the path. I am definitely all for the move from using the path to specify code into using the path to specify data. But religiously staying away from query parameters because they are ugly is just plain stupid. One of the uses for query parameters that I really like, is for view manipulation uses. For example to hold the offset and sort order of a table of data. Lets look at an example.

http://foo.bar/users?offset=300&sortby=username

You could put these parameters into the path. But by keeping them as query parameters it is clearly visibly to the user what they are and how they change as she works with the table. For more advanced users, this will also make it easier to manipulate the url directly.

So let me end this looong rant with the guidelines I use when designing urls for web applications.

The path should be readable and specify data on code.
Things that modify the view of the data should be placed in query parameters.
Avoid using characters in id's that are easily confused such as 0, O, I, 1, l.

Thursday, February 22, 2007

Regular Expressions and String Literals

As previously stated on this blog, I really love Java. I am also very much in love with regular expressions. Unfortunately, these two girls just don't get along. Sure, they will say they do, but whenever I try to get them to cooperate the fighting begins.

If you don't like Java, you will probably use this as an opportunity to say "told you so!". If you like Java but haven't worked much with regular expressions in Java, you will probably point me to the big and wonderful API for working with regular expressions and tell me to go read the documentation... surely, this API must be as nice as the rest of the Java APIs. If you both like Java and have worked seriously with regular expressions in Java, you will probably be nodding right now and saying "Ah yes... string literals".

See the problem with Java and regular expressions, isn't the functionality of its regular expression library. Its the fact that you don't have literals in the language to enter regular expressions that makes it a pain instead of a pleasure.

Let me show you an example. String objects in Java have a nice method to do replacements based on regular expressions. It takes two Strings as parameters. First the regular expression to search for, then another string to replace with. The replacement string can include $ characters to reference groups caught by the regular expression. So the following call would replace all a's with o's.

String str="Foo Bar";
str=str.replaceAll("a","o");

Simple right? So what happens when we want to replace a \ with \\ instead of an a with an o? Lets look at the regular expression part. First we need to pad the \ into a \\ to satisfy the regular expression as it sees the \ as a special character. Then we need to do the same for the Java string literal that we are using to enter it and we must naturally do that for both characters, so we end up with "\\\\" to match a single \.... and how about the replacement part? Well, the same thing applies there, except that we end up with twice the number of characters giving us the following two parameters for the replaceAll method.

str=str.replaceAll("\\\\","\\\\\\\\\");

Take a close look at that. Can you actually tell that there are exactly eight characters without using any assistance such as a finger or your mouse cursor to count them? If so, you need to reevaluate your own counting abilities, because there where actually nine characters - the last one was just thrown in to show how hard it actually is to count that many identical characters quickly. Lets take a look at the correct version for a second.

str=str.replaceAll("\\\\","\\\\\\\\");

It still looks silly doesn't it? Remember that this is just a very simple expression trying to replace a single character with two of the same kind. For this very simple example we could actually have grouped the match and referenced it as follows.

str=str.replaceAll("(\\\\)","$1$1");

However, once you start doing any serious regular expression work that includes characters that need escaping for either the regular expression, the string literal or both things get ugly very quickly. The sad thing is that it really doesn't have to be this hard, there is really no reason why we couldn't have regular expression literals in Java... perhaps even with a customizable escape character if we were really lucky.

Monday, January 29, 2007

TupleSoup, Part 5

I'm currently sitting in Heathrow airport, waiting for my connecting flight back to Copenhagen. I still have a little over 12 hours to go. If you have been through Heathrow recently, you probably have a good feel for the pain I'm suffering right now. Finding a decent seat with a power outlet nearby is almost impossible. Finding a proper meal is even harder... If only Starbucks had power outlets near their seats, I would at least be able to enjoy a nice cup of coffee and some good snacks (even though I would also go broke within a couple of hours). So, now that I'm sitting here with plenty of time and nothing important to do, why not continue my series on TupleSoup.

I have now implemented simple data data engine which allows you to store, update and retrieve data. I do have a simple RowMatcher which can be used to select a subset of rows in a table based on the values of these data (think where clause in SQL) but I have no other functionality to work with a set of data instead of single rows.

If we look at TupleSoup from a database users point of view, the next logical step will be to add functionality for working with rows using sort, join, aggregate and selection filters. The aggregation and selection filters are pretty straight forward to implement, but the sort and join functionality will take a little more work. Not because they are inherently more complex features, but because one of the nice features in TupleSoup is that we can fetch result sets that are much larger than available memory. This is accomplished by returning an iterator which reads rows as they are requested. Implementing an aggregation or selection filter that continuos to treat the data as a stream is straight forward, but implementing sort and join functionality without holding the entire dataset in memory is far harder.

The rest of this post will be spend discussing how to sort large datasets without the need to hold the whole set in memory. The basic idea is that we split the task up into smaller tasks and keep the partial results in disk based buffers. So lets first start out looking at a buffer for row data.

Row Buffer

The purpose of the row buffer is to store a set of rows and deliver them again later as an iterator in the same style as if we were accessing a table directly. The simplest implementation would simply hold the rows using a Java List and return its build in iterator when the data is needed again, but since we are going to be using this buffer specifically to avoid holding the full data set in memory, that won't work. Instead, we will use a single flat file to store the data in. Since we are only required to deliver the rows in sequence, we have no need for any kind of index functionality like the regular table. The following table shows the performance of the row buffer compared to the regular table.

Test type	Table	RowBuffer
Add row	0.296 ms	0.023 ms
Fetch row	0.177 ms	0.012 ms

It looks like the row buffer is around ten times faster than the regular table. Naturally, thats not fast enough for a greedy bunch of software developers like us, so I have added a cache to it which uses a regular Java list to store all rows until a given cache size has been used up. In that way, you can set the maximum acceptable memory usage for your query and the row buffers will ensure to keep things as fast as possible within that limit.

Sorting

If you have read even a little bit about algorithms you have probably come across several sort algorithms such as bubble sort, quick sort, heap sort and radix sort. If you haven't read anything about sort algorithms, you should at least go check out bubble and heap sort.

Under normal circumstances, you would choose your sort algorithm based on its big O complexity and memory usage, but as mentioned in the introduction of this post, we can't rely on the data set being able to fit into memory in TupleSoup. Our goal is thus to choose a sort algorithm that will allow us to sort a huge dataset with a minimum of memory requirements. One algorithm that works particularly well for this is merge sort. The idea behind merge sort is that its quite easy to merge two lists that are already sorted by themselves. In the simplest implementation, you recursively split your dataset into lists until you end up with a set of lists that hold just one element. Then you start merging the lists back together until you end up with one sorted list.

Instead of dividing your initial dataset all the way down to one element lists, you can divide it into a set of lists which are each just small enough to be handled in memory, and then use another sort algorithm to sort these in memory before merging them together. This is exactly how the merge sort in TupleSoup has been implemented. You give it a maximum memory usage which it will split between cache usage for its row buffers and for in memory sorting. Then it uses the sort functionality available in Java's collection to sort small chunks in memory at a time, which it then writes out to row buffers. Once the whole dataset has been presorted in this way, it uses merge sort to merge the resulting row buffers into one final sorted rowbuffer.

The following table shows the performance of the merge sort algorithm in TupleSoup. I have added the sort functionality in Java's collection framework, which also uses merge sort, but keeps everything in memory. Each test sorts 100000 elements which in total takes up a bit more than 17 mb of space when held in memory.

Test type	Time
Collections.sort	328 ms
MergeSort 4 mb	10260 ms
MergeSort 2 mb	11729 ms
MergeSort 1 mb	15334 ms
MergeSort 512 kb	17848 ms

There is an obvious benefit from having a larger cache size, but even with a good 4 mb of cache we are still much slower than the pure memory sort. One possible improvement to better utilize the cache size, is to only keep the actual values being sorted along with the id for each row. Sorting that reduced dataset and then joining it back up with the full dataset will allow us to sort far bigger chunks in memory. We will return to this in the next post which will cover join operations :-)

Wednesday, January 24, 2007

TupleSoup, Part 4

I started the TupleSoup implementation without using an index to keep things simple so I could get a working code base as soon as possible. Then I added the index for faster random access operations. It turned out that full fetches of the whole dataset using the index is not that much slower than scanning the files without it especially with an index cache. Relying on the index for all operations removed the need for the delete file and thats just about where the implementation was left by the time I wrote part 2 of this series.

However, there is really no need for an update file either. A single data file and an index would serve my needs quite well.... except, having two files brings us a new set of possibilities especially if we start viewing them as equals functionality wise. So from now on TupleSoup will store three files for each table: file-a, file-b and an index.

The most obvious gain from this, is that it will now possible able to add and update rows in parallel, locking only on the index. This will in theory allow us to perform better when used in a multithreaded environment. But the far more exciting result from this change, is that we can now lock one file for changes without affecting the functionality of the system. If we lock file-a for changes, all adds and updates simply go to file-b. This will allow us to do maintenance work on the data files while the system is running. One such task could be to compact the data files to reclaim space lost when rows are deleted or updated. Another could be to rearrange rows to match an external data index, improving performance when reading rows based on that index.

We will have a closer look at the design and implementation of those maintenance tasks in a later post discussing data integrity and recovery, but for now, lets end this post with a look at the performance of TupleSoup with the new file-a and b. These tests are run with the paged index, 50.000 rows and an index cache of 10.000 entries. Furthermore, I finally got a new battery from Apple after my old one swelled up and died, so I am now running at full speed again making these numbers incomparable to the previous tests.

Test type	Original	Dual file
Add	0.281 ms	0.279 ms
Fetch	5.590 ms	6.317 ms
Update	5.494 ms	6.286 ms

There is a slight, but acceptable, penalty to bulk fetches and updates from a single thread, but the adds are about the same. I have not yet created a usable multithreaded benchmark, so the possible performance gain is still unknown ;-)

Sunday, January 21, 2007

TupleSoup, Part 3

The clear lesson from my last post on TupleSoup was that the index is the primary performance bottleneck in my current implementation. This was especially evident in the update operation that needs to access the index twice if the updated row won't fit in the same space as the original. The next step for improving TupleSoup performance will thus be the implementation of an index cache. The goal is to get close to the same high performance we got from the memory based index for often used rows.

The actual caching of index entries is very easy to implement. Simply create a hashtable and use it to cache your entries using the row id as the key. The slightly trickier part is how to maintain the index so it doesn't grow above a fixed maximum size. The optimal implementation (some times referred to as a clairvoyant replacement algorithm) would simply keep the things in the index that we will need access to next. Unfortunately, to even come close to accurately predicting this, we would need a very thorough understanding of the system that is going to be using TupleSoup. That doesn't really fit well with the goal of TupleSoup being a generic data engine, so instead we must look for a generic algorithm which maintains the cache based on usage.

A good source for theory on cache algorithms is theory for page replacement algorithms. These are algorithms that are used by operating systems to choose which memory pages to keep in memory and which to swap out when new pages are read from disk. Some of these algorithms uses knowledge about how applications use memory. This is naturally not of use to us, but the ideas and concepts are very useful for designing our own caching algorithm. Unfortunately, I am going to disappoint you now, because all I will implement to begin with is a very simple least recently used algorithm. I do have some fun ideas for more advanced algorithms, but they are closely linked to value based indexes and query caching. Since we currently have neither a value based indexing mechanism or anything even resembling a query system, this will have to wait.

But for now, lets look at the least recently used algorithm. The idea is very simple. We use a fixed sized list of index entries. Whenever we use an entry, we remove it from its current position in the list and move it to the front. When we need space to add a new entry, we remove the entry currently in the back and add the new one in the front. The idea is that entries which are used often, will be "saved" and put to the front again, while entries rarely used will drop to the back and be removed.

If the cache is big enough to keep recurring entries in the list until they are used again, this can be very effective. Unfortunately, the larger the cache, the bigger the overhead will be from maintaining the cache. Lets have a quick look at the same benchmark as we used when we looked at indexing algorithms, but this time with a least recently used cache algorithm in use. We will use the big 50.000 rows dataset with a paged index and a cache size of 10.000 index entries. This wont give us any real benchmark on how the cache algorithm will perform for actual usage, but it will reveal the overhead involved when adding new entries and a slight hint of the effect in the fetch and update tests.

Test type	Original	With cache
Add	0.582 ms	0.621 ms
Fetch	18.430 ms	16.989 ms
Update	47.355 ms	23.382 ms

The add is slightly slower, but even in this very unnatural benchmark we actually end up getting a performance improvement in both fetch and update. Lets increase the size of the cache to 50.000 entries so it can hold the full dataset. This should give us an idea of the best performance we can expect from it if the usage of the dataset falls perfectly within the bounds of the least recently used algorithm.

Test type	Original	Full cache
Add	0.582 ms	0.588 ms
Fetch	18.430 ms	1.197 ms
Update	47.355 ms	14.557 ms

We are still far away from the performance of the memory based index, even in the fetch test due to the overhead of the caching algorithm, but we are definitely approaching a good usable performance.

Friday, January 19, 2007

TupleSoup, Part 2

The goal of TupleSoup is to deliver a data engine that is able to store and retrieve schemaless rows with a variable length string id. For my own use, it is important that the data is contained in as few files as possible, I have thus chosen a basic layout of four files (actually only three, but lets start out by assuming four): add, update, delete and index. There is not really much fun in the first three data files. If you add a row, append it to the add file. If you update a row append it to the update file and finally, if you delete a row append its id to the delete file. In theory, we don't need the index to retrieve the data. To fetch a specific row just follow these simple rules until one matches.

Check the delete file. If the rows id is in that, it has been deleted.
Scan through the update file. If the row is found there, the last copy in that file will be the correct version.
Scan through the add file until you find it.

This is all very very simple to implement and it will work well for small datasets. However to locate a row we may need to scan all the way through three files, two of which could be quite big as the dataset grows. This is where the index comes to use. The index will store an entry for each row id which says in which file and at which offset the row can be found. The task of locating a row has thus been reduced to looking up the id in the index and going directly to the right file at the right offset to read it.

Now that we have an index, we can also optimize the update system to achieve a lower waste of disk space without too big a penalty. By using the index, we can go straight to the current location of the row and write the new row in its space if it will fit. We don't even need to check the size of the old row first if we add the size of the row to the index. Relying fully on an index will also make the delete file redundant saving even more space and getting closer to my goal of a very simple file layout.

Unfortunately, the index isn't free. When we add or modify a row, we can no longer just append it to a file, we must also maintain the index. There are many different ways to implement an index, each of which will perform well in some areas and poor in others. For the initial version of TupleSoup, I have chosen to implement three different index types: a memory based with a journal, a flat file and a paged file. The following sub sections will describe each of these indexes and provide rudimentary benchmark results. The benchmark will measure the average time to add, fetch and update a row in a 25.000 and 50.000 row dataset. Be aware, that these tests are run on my macbook pro 17" core duo 2.16ghz with a 5400rpm disk. Its currently running at a lower speed due to a dead battery, so don't take these times for an actual performance indication.

Memory Index

The memory based index keeps an entry for each row in a hashtable making lookups insanely fast. Whenever changes to the index are made, they are appended to the index file. This will function as a journal that will allow the index to recreate itself in memory after a crash. Lets look at the test results.

Test type	25.000	50.000
Add	0.068 ms	0.069 ms
Fetch	0.137 ms	0.136 ms
Update	1.512 ms	1.970 ms

It is quite obvious from these results that the memory based index experiences a very low penalty from a growing dataset. However, to use this index type we must be able to keep the full index in memory at all times. The current implementation of the index will use 52 bytes plus four times the number of characters in your ids pr index entry. As an example, if we have a dataset with 50.000 rows, each with a an average id length of 20 characters, then the index will take up 50.000 * (52+20*4) = 6.600.000 bytes = 6.445 kb. That may not seem like very much, but you need to be 100% sure that your dataset has no way of growing beyond the memory available to TupleSoup. If that should happen, you might not be able to start TupleSoup up to access your data any more.

Flat index

The second index type is the flat index. It uses a single index file much in the same way the memory based index does. However, it does not cache anything. Instead it scans through its index file whenever a request is made. Furthermore, it does not simply append changes to rows as the memory index does. Instead it updates the existing entries in its file when they are modified.

Test type	25.000	50.000
Add	0.237 ms	0.257 ms
Fetch	11.816 ms	23.997 ms
Update	30.014 ms	60.636 ms

These numbers are pretty much as expected. Add's are very quick and are not really affected by the size of the dataset since we are still just appending to a single file. Fetches are almost 100 times slower than the memory based index and become proportionally slower as the dataset grows. Updates are only around 20 times slower than the memory based index due to a big part of the task not being index related, but they also suffer immensely from a growing dataset.

Paged Index

The problem with the flat index is that we have to scan through the full index to find the entries at the end of it. A simple approach to fix this is to divide the index file into pages of entries along with a method that can quickly help us find the right page for a given row id.

The paged index does exactly this. It uses a single file divided into pages of entries. Each page has two pointers, one to the next page and one to a page where the hashcode for the ids are lower than the first entry in itself. To scan for an entry for a row id in this index, follow these simple rules. Starting at the first page.

If the hashcode for the id you are looking for is lower than the first entry in this page, start again with the page marked as lower for this page.
If its equal to or higher, scan through this page. If the entry is not found here, start again with the page marked as next.

This index gives us a simple tree structure which can help us locate the needed record faster. However, to add a new entry, it is no longer enough to just append it, you must scan through the tree structure to locate the first valid page which has enough free space for the new entry. Also, there is no guarantee that this tree will be balanced. In fact, if you keep adding entries with a lower hashcode than the last, they will end up creating a new page for each entry, giving you a huge slow index. However, if entries are added in a somewhat random order with regards to hashcodes, chances are that it won't be that bad. Lets have a look at how the paged index compares to the flat index performance wise.

Test type	25.000	50.000
Add	0.582 ms	0.582 ms
Fetch	13.196 ms	18.430 ms
Update	34.015 ms	47.355 ms

As expected, add's are slower than with the flat index. They do however seem somewhat stable even when the index grows in size. The fetch is initially slower than the flat index for a small dataset, but is less affected by a growing dataset, beating the flat index in the test with 50.000 tuples. The same goes for the update test.

B-Tree Index

The primary problem with the paged index is the risk of generating a highly unbalanced tree. To improve upon this, we would have to use a self balancing structure such as a b-tree. Under normal circumstances, this wouldn't be that hard to implement, but one of the key design goals for TupleSoup is to have variable length ids. This means that each node in the tree could take up a unique number of bytes on disk, making the task of re-balancing the tree much less fun.

Unfortunately, I'm feeling lazy and the current indexes will quite happily serve my needs for now, so the b-tree will have to wait for another day :-)

Saturday, January 13, 2007

TupleSoup, Part 1

I am in desperate need of a small embedable database for a project at work. It would be far too easy to just pick one among the many excellent open source Java databases already available (H2, Derby, tinySQL, Mckoi, QED etc.), so naturally I am writing my own. As I work on this project, I intend to regularly post about my design and implementation choices on this blog. The complete code will be posted on sourceforge as soon as it is usable to anyone but me in any way.

If the design and implementation of a small database sounds appealing to you, why not follow me here while I do my very best to document the process in the most entertaining way possible... ok, that might be overstating it a bit - but, if you haven't written one yourself yet, you might want to read along to learn a bit from my mistakes and if you have already written one or two, you might want to stick around to get a good laugh out of those same mistakes. So let me start out by introducing my vision for what this little database should be able to do.

Last year I read Andrew Hitchcock's excellent post about the talk Jeff Dean gave at the University of Washington on Google Bigtable. It was later followed up with an official article (If you have any interest in alternative databases at all, I highly recommend that you read the article, or at least Andrew Hitchcock's blog post). That article really got me thinking about the possibilities of using a simple schemaless database with variable length strings as id's for records for my daily work. I had previously really enjoyed implementing a simple schemaless database, which I have long been thinking about rewriting. So thats where I am at now. TupleSoup will be the core data engine in a new database I am going to start writing (I hope I will also be able to open source that database, but I am currently discussing that with my employer). It will be schemaless and rows will be stored and fetched based on variable length string ids.

I could go on ranting for hours about how I would like to implement this data engine and how I would like to be able to use it, but instead I'm going to stop blabbering now and end this post with a few lines of code demonstrating how I imagine that the finished product would be used.

Table tbl=new Table("datafile",".");

Row row=new Row("kapper");
row.put("name","Kasper J. Jeppesen");
row.put("logins",0);
tbl.addRow(row);

row=tbl.getRow("kapper");
row.put("login",row.getInt("login")+1);
tbl.updateRow(row);

Tuesday, January 9, 2007

Why I still love Java

I do most of my large scale software development in Java and all of my scripting in Ruby. Its a wonderful combination that keeps me happy and productive.

Mentioning Ruby is greeted by most other developers with smiling nods of approval - its young, trendy and easy to love - however, mentioning that I use Java as my main language gets very mixed reactions. Everything from pointing and laughing to a direct punch to the face is possible. Ok, maybe not a direct punch, but at least a girlish slap or a really mean insult.

Unfortunately, explaining what I love about Java can't be done in a single sentence or two as a quick response to "Your language sucks!". It takes vast amounts of time and will often require me to peel away layers upon layers of prejudice. Since I am by nature a very lazy person I have decided to write this post so I can simply direct people here in the future. I'm sure that the number of Java hating people who will actually read this post will be very very close to zero, but thats an insignificant fact I can easily trick myself into forgetting.

Lets first get the most common type of prejudice I encounter as a Java developer out of the way. No, Java is not the only programming language I know and yes I have developed plenty of larger projects in languages without garbage collection - even in C. In fact, let me take this opertunity to bore you with the long and dull tale of my background as a software developer.

I took my first steps as a software developer in the mid eighties with Logo on my dads Apple II. The fact that I could make that little turtle on the screen do anything I wanted was so incredibly alluring. Yes, I know it was just a little triangle on a low resolution monochrome screen, but I was not even ten at the time and had never seen a real turtle. To me the experience was as close to having my own little reptilian slave, as I could imagine.... see, this is the kind of power trip functional programming will bring forth in young minds!

A couple of years later I moved with my mom and sister to Saudi Arabia. We lived near a hospital where my mom worked. Several of her friends among the local doctors helped out teaching me and my sister english, math and physics. Two of them were very interested in computers and soon introduced me to BASIC on a ZX Spectrum and a lunchbox sized Sharp PC7000. This opened up a whole new world for me. Not only was basic a fun language to learn for an 11 year old boy, but I now had two enthusiastic intelligent programmers who would gladly spend their time sharing their knowledge. This was so addicting that I had to get my own computer programmable in Basic. I ended up getting a small Casio fx-785P calculator. Not only was it programmable in BASIC, but it also had a rudimentary assembler and came with a small booklet introducing me to the concepts of assembly programming.

After we returned to Denmark in the late eighties, I finally got my own full sized computer - an Amstrad CPC 6128 which let me move on to Pascal as my next language. The structure and encapsulation that Pascal brought to my programs allowed me to progress in the size and scope of my projects.

In 1991 I went to Canada for a year as an exchange student. Here I excitedly enrolled in the computer science class at the local high school and was greatly dissapointed when I completed the years worth of material in about two months without really learning anything new. Luckily, my teacher in this class was just as excited to actually have a student with a passion for programming that he gladly spent the rest of the year introducing me to C and C++.

Upon returning to Denmark, I sold my Amstrad and bought an Amiga 500 which would once again bring me back to assembly programming, this time on the far more elegant Motorola 68000 CPU. The early nineties was an amazing time to be a programmer on the Amiga in europe. I was no longer alone as an enthusiastic amateur programmer, but surrounded by other young people competing in demo competitions all over europe.

In 1995 I started studying for my masters degree in computer science at a danish university (I never actually completed my masters.. I think I am still lacking about half a semesters worth of non cs related course work). I never really felt that the actual courses at the university gave me that much, but the vast amounts of information and topics that it introduced me to moved me further as a programmer than anything else had in the previous ten years. I started learning new languages as a hobby and soon developed a serious crush on the simplicity of the Forth core, the object oriented beauty of Smalltalk and the strangely different Prolog. Unfortunately, none of these easily lovable languages were able to keep up with all of my needs. So I stayed faithfull to C++.

By the end of 1995 I discovered Java. It was slow as hell, but strangely appealing. I started out using it purely for applets, which brought back a lot of the excitement of the demo scene on the Amiga. I even created a small applet (yeah, I know, that thing looks ridiculous today, especially since I didn't time sync it) for the Java intro competition at The Party in 1997. I was having more fun programming than I had had in a long time. I was still trying out new languages continouesly, but I never really found anything that was a much fun as Java at the time.

As time passed, so did the speed of Java and the size of its bundled API. If Sun hadn't kept focus on improving the Java platform, I'm sure I would have left it many years ago, but somehow it kept feeling like the development of the Java platform was keeping pace with my needs as a programmer. So it was only natural for me to keep using it as my primary language. Thats basically the story of how I ended up with Java as my primary language, which brings us to the actual point of this post. What it is that still makes me love Java today. There are four primary features that currently keep Java as my primary language.

A modern garbage collector
I have no problems writing well functioning code in languages without garbage collection or with manually assisted garbage collection such as pre-leopard Cocoa. However, with the fast well functioning garbage collector in Java, I feel like I am wasting my time whenever I handle these things manually in other languages.

A full featured consistent API
One of the things that finally made me drop PHP as a web development platform was the consistently inconsistent API. In Java the standard API is consistent enough that I can often guess how it works and how methods will be named.

Good performance
I don't necessarily need to be working in the fastest language/API combo possible, but performance is a factor for most of the software I develop so it needs to be close enough to the top, that I can remain competitive.

A good security model
There seems to be a direct correlation between the size of my programmer-ego and the number of mistakes I assume I won't make. By choosing a language with a good security model I can spend less time worrying about the quality of my code and more time bragging about my coding skills.

Java satisfies these four demands quite well. But, honestly any language/API combo that would satisfy these demands just as well for my intended target platform would work for me. I have no religious attachments to Java, but the fact that it has evolved hand in hand with my needs over the past 11 years and kept my life as a programmer fun has kept it as my primary language.

Monday, January 8, 2007

Syntactic Sirup

It's quite obvious that the world lacks yet another variation of the term syntactic sugar. Let me thus introduce a new one with the naming of this blog.

Syntactic Sirup: An addition to the syntax of a language which at first seems wonderful and sweet, but will soon leave your code in a sticky mess that is more easily rewritten than cleaned up.