Sunday, January 21, 2007

TupleSoup, Part 3

The clear lesson from my last post on TupleSoup was that the index is the primary performance bottleneck in my current implementation. This was especially evident in the update operation that needs to access the index twice if the updated row won't fit in the same space as the original. The next step for improving TupleSoup performance will thus be the implementation of an index cache. The goal is to get close to the same high performance we got from the memory based index for often used rows.

The actual caching of index entries is very easy to implement. Simply create a hashtable and use it to cache your entries using the row id as the key. The slightly trickier part is how to maintain the index so it doesn't grow above a fixed maximum size. The optimal implementation (some times referred to as a clairvoyant replacement algorithm) would simply keep the things in the index that we will need access to next. Unfortunately, to even come close to accurately predicting this, we would need a very thorough understanding of the system that is going to be using TupleSoup. That doesn't really fit well with the goal of TupleSoup being a generic data engine, so instead we must look for a generic algorithm which maintains the cache based on usage.

A good source for theory on cache algorithms is theory for page replacement algorithms. These are algorithms that are used by operating systems to choose which memory pages to keep in memory and which to swap out when new pages are read from disk. Some of these algorithms uses knowledge about how applications use memory. This is naturally not of use to us, but the ideas and concepts are very useful for designing our own caching algorithm. Unfortunately, I am going to disappoint you now, because all I will implement to begin with is a very simple least recently used algorithm. I do have some fun ideas for more advanced algorithms, but they are closely linked to value based indexes and query caching. Since we currently have neither a value based indexing mechanism or anything even resembling a query system, this will have to wait.

But for now, lets look at the least recently used algorithm. The idea is very simple. We use a fixed sized list of index entries. Whenever we use an entry, we remove it from its current position in the list and move it to the front. When we need space to add a new entry, we remove the entry currently in the back and add the new one in the front. The idea is that entries which are used often, will be "saved" and put to the front again, while entries rarely used will drop to the back and be removed.

If the cache is big enough to keep recurring entries in the list until they are used again, this can be very effective. Unfortunately, the larger the cache, the bigger the overhead will be from maintaining the cache. Lets have a quick look at the same benchmark as we used when we looked at indexing algorithms, but this time with a least recently used cache algorithm in use. We will use the big 50.000 rows dataset with a paged index and a cache size of 10.000 index entries. This wont give us any real benchmark on how the cache algorithm will perform for actual usage, but it will reveal the overhead involved when adding new entries and a slight hint of the effect in the fetch and update tests.

Test typeOriginalWith cache
Add0.582 ms0.621 ms
Fetch18.430 ms16.989 ms
Update47.355 ms23.382 ms

The add is slightly slower, but even in this very unnatural benchmark we actually end up getting a performance improvement in both fetch and update. Lets increase the size of the cache to 50.000 entries so it can hold the full dataset. This should give us an idea of the best performance we can expect from it if the usage of the dataset falls perfectly within the bounds of the least recently used algorithm.

Test typeOriginalFull cache
Add0.582 ms0.588 ms
Fetch18.430 ms1.197 ms
Update47.355 ms14.557 ms

We are still far away from the performance of the memory based index, even in the fetch test due to the overhead of the caching algorithm, but we are definitely approaching a good usable performance.

2 comments:

Andrew Hitchcock said...

Hi. This blog is getting interesting to follow. I'm adding you to my RSS and looking forward to the next update :)

Kasper Jeppesen said...

Thanks... its still pretty basic stuff, but it should pick up the pace soon ;-)