Monday, February 26, 2007

Minimizing Id Size While Maximizing Readability

The last few posts have primarily been me bitching about things that annoy me, so I thought I would throw in a little happy-go-lucky constructive post with almost no bitching at all.

I mentioned in my last post about pretty urls that I always attempt to create my ids so they avoid using characters that are easily confused, such as I, 1, l, O and 0. Doing so is actually very easy and in this little post I will show you how I do it in Ravenous and even provide you with a ready to use implementation in Java.

The basic idea is that we convert integer ids into a string representation using only the letters and numbers that are not easily confused. For my implementation, I chose to use the following 31 characters: 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F, H, J, K, M, N, P, Q, R, S, T, U, V, W, X, Y, Z.

A positive side benefit from this, is that we will also be moving from a base 10 system to a base 31, which means that we in many cases will be seriously reducing the number of characters used for an id.

So lets start talking about the implementation by looking at the way we would convert a number into a string. First, lets build an array of chars containing our 31 character set.

char[] digits={'2', '3', '4', '5', '6',
               '7', '8', '9', 'A', 'B',
               'C', 'D', 'E', 'F', 'G',
               'H', 'J', 'K', 'M', 'N',
               'P', 'Q', 'R', 'S', 'T',
               'U', 'V', 'W', 'X', 'Y',

We use this array to directly lookup the character a number between 0 and 30 should be converted into. To convert a full number into a string, we simply keep dividing it by 31 and each time adding the character representing the remainder of the division to the string. When the remaining number is 0 we are done.

String convertToString(long num){
    StringBuilder buf=new StringBuilder();
        buf.append(digits[num % 31]);
    return buf.toString();

Easy right? Converting a string back into a number is just as easy. This time simply run through each character, convert the single character back into a number based on its position in our translation array and multiplies that with 31 raised to the power of the current character position. Sound confusing? Try imagining if we where working with a 10 based system. The value of the first character would be multiplied by 10 ^ 0 = 1, the next by 10 ^ 1 = 10, the next by 10 ^ 2 = 100 and so forth. Thus multiplying each character by the value of its position. This is the exact same thing we are doing in our base 31 system. So lets look at the code. I'm assuming that we have a method called charToInt that will convert a base 31 character into its int value from our table.

long convertToLong(String str){
    long num=0;
    for(int i=0;i<str.length();i++){
        char c=str.charAt(i);
    return num;

Easy right? So how about that charToInt method. Well, there are several strategies you can use for implementing it, but honestly the simplest is just to create a switch statement that utilizes the numeric value of the char to switch into a statement that returns the right value.

int charToInt(char c){
        case '2':return 0;
        case '3':return 1;
    return 0;

I have omitted the last 29 cases, but you should be getting the point by now :-). The last return is purely there to satisfy the Java compiler and to provide a semi rational behavior when (not if) someone attempts to feed it an invalid id string. If you want to optimize this further, you could create an array just like the one converting from numbers to characters. All you need to do is find the largest numeric value of those 31 characters, create an array of that size and fill in the right values.

So what do these ids look like and how much space will we save. Lets end this article with a set of sample ids, the strings they would be converted into and a rounded down percentage for how much space we are saving.

130 %
12E50 %
123Z533 %
1234UA325 %
123459VE20 %
123456GG6633 %
1234567SPFC328 %
12345678QQEDF37 %
123456789425QB633 %

Friday, February 23, 2007

The Rise of the URL Fascist Regime

Some time within the past couple of years, something really strange happened. Most of my web developing friends turned into a bunch of very vocal url fascists.

Urls are suddenly divided into two sharply divided groups. The urls that fit their views and the urls that do not. If your urls do not fit into the group supported by the fascist leaders, it doesn't matter what the reasons are for your url design or how great your system is. Your urls suck, thus your system sucks and you as a developer are at best hopelessly misguided.

So what is this all about?

The people I'm referring to are not graphic designers who do a bit of web development. The leaders of this regime are all programmers who feel right at home in emacs and vi. The natural assumption would thus be that this is one of the regular crusades for following standards, but thats not it at all... this is not about making urls more technically functional. This is about making the urls look pretty, thats actually what the whole topic is called - pretty urls. Not readable urls, not memorizable urls, not semantic urls, not understandable urls, not hackable urls, not short urls.... but pretty urls.

I decided to ask around a bit, to see if I could better understand what exactly divides the ugly from the pretty. It turns out that there are varying opinions about what the ultimate pretty url looks like, but everybody seems to agree on what an ugly url looks like. One of my friends gave me the following example.

Ok, even I can follow that this is not pretty. I would also say that its not readable, memorizable or understandable either. However, I do think it has some good functional sides which I will return to later.

So, I tried to create some different examples of urls which I passed around to get a prettiness rating and I finally seemed to figure out the primary difference between a pretty and an ugly URL. Take a look at the following two URLs.

The ruling from the fascist regime is that the first url is ugly and the second one is pretty. My first instinct was to conclude that this was a question of keeping the urls as short as possible, but it turns out that thats not it at all. The following url combines the two by adding the name of the parameter as a part of the url path.

Almost everybody agreed that this was the prettiest of them all. Since this is also the longest of them all, that rules out the short is better theory. As can easily be deduced from the fact that the fascist leaders didn't mind the actual word id being part of the path, its not a question of exposing the names of the variables to the users. Its all a question of keeping the ? = and & characters out of the URL. Apparently these are really ugly characters.

Here are a few more examples of pretty urls given to me by people from the new regime. Notice how they are not designed to be short. Instead they do everything possible to keep all parameters in the path.

Notice another thing here. None of these refer to data using an id. They use nearly readable names instead. It should be mentioned though that thats not a fixed rule within the pretty url regime, some people still use ids.

So, to sum it up: Pretty urls must do everything possible to avoid the regular query parameters and if it makes sense, they should be humanly readable.

Unfortunately, none of the believers in the new pretty url regime can actually give an explanation towards the evilness of the query parameters. Most people I asked could not come up with anything better than "its ugly". In fact, most of these people can't even give a good precise definition of a pretty url and even less an explanation towards why its good. In fact, the following three statements where the only direct statements I could get out of any of them (it should be noted that two of these statements where given to me in danish, so these are my translations).

  • Each element in the path should be a further refinement into the dataset.
  • The overall path should be humanly readable and understandable.
  • Query parameters should not be used for navigation.

Its not that I completely disagree with these ideas, but it seems that everybody has gotten so focused on the new and wonderful world made possible by brand new technology such as mod_rewrite (been there since 1996) that everything must now be done in this way or its just way too web1.0 to be acceptable.

So why am I making a big fuzz about this and calling half of my friends fascists? First of all, I'm calling them fascists because they consider the system to be more important than the individuals. That is pretty urls are more important to them than the actual websites. Secondly, I'm making a big fuzz because query parameters actually do have some good things going that shouldn't be left out just because you think a question mark or an equal sign is ugly. As long as you use proper parameter, they let the user know what the individual "magic" parameters actually are.

I'm not saying that you shouldn't put anything in the path. I am definitely all for the move from using the path to specify code into using the path to specify data. But religiously staying away from query parameters because they are ugly is just plain stupid. One of the uses for query parameters that I really like, is for view manipulation uses. For example to hold the offset and sort order of a table of data. Lets look at an example.

You could put these parameters into the path. But by keeping them as query parameters it is clearly visibly to the user what they are and how they change as she works with the table. For more advanced users, this will also make it easier to manipulate the url directly.

So let me end this looong rant with the guidelines I use when designing urls for web applications.

  1. The path should be readable and specify data on code.
  2. Things that modify the view of the data should be placed in query parameters.
  3. Avoid using characters in id's that are easily confused such as 0, O, I, 1, l.

Thursday, February 22, 2007

Regular Expressions and String Literals

As previously stated on this blog, I really love Java. I am also very much in love with regular expressions. Unfortunately, these two girls just don't get along. Sure, they will say they do, but whenever I try to get them to cooperate the fighting begins.

If you don't like Java, you will probably use this as an opportunity to say "told you so!". If you like Java but haven't worked much with regular expressions in Java, you will probably point me to the big and wonderful API for working with regular expressions and tell me to go read the documentation... surely, this API must be as nice as the rest of the Java APIs. If you both like Java and have worked seriously with regular expressions in Java, you will probably be nodding right now and saying "Ah yes... string literals".

See the problem with Java and regular expressions, isn't the functionality of its regular expression library. Its the fact that you don't have literals in the language to enter regular expressions that makes it a pain instead of a pleasure.

Let me show you an example. String objects in Java have a nice method to do replacements based on regular expressions. It takes two Strings as parameters. First the regular expression to search for, then another string to replace with. The replacement string can include $ characters to reference groups caught by the regular expression. So the following call would replace all a's with o's.

String str="Foo Bar";

Simple right? So what happens when we want to replace a \ with \\ instead of an a with an o? Lets look at the regular expression part. First we need to pad the \ into a \\ to satisfy the regular expression as it sees the \ as a special character. Then we need to do the same for the Java string literal that we are using to enter it and we must naturally do that for both characters, so we end up with "\\\\" to match a single \.... and how about the replacement part? Well, the same thing applies there, except that we end up with twice the number of characters giving us the following two parameters for the replaceAll method.


Take a close look at that. Can you actually tell that there are exactly eight characters without using any assistance such as a finger or your mouse cursor to count them? If so, you need to reevaluate your own counting abilities, because there where actually nine characters - the last one was just thrown in to show how hard it actually is to count that many identical characters quickly. Lets take a look at the correct version for a second.


It still looks silly doesn't it? Remember that this is just a very simple expression trying to replace a single character with two of the same kind. For this very simple example we could actually have grouped the match and referenced it as follows.


However, once you start doing any serious regular expression work that includes characters that need escaping for either the regular expression, the string literal or both things get ugly very quickly. The sad thing is that it really doesn't have to be this hard, there is really no reason why we couldn't have regular expression literals in Java... perhaps even with a customizable escape character if we were really lucky.