Thursday, February 22, 2007

Regular Expressions and String Literals

As previously stated on this blog, I really love Java. I am also very much in love with regular expressions. Unfortunately, these two girls just don't get along. Sure, they will say they do, but whenever I try to get them to cooperate the fighting begins.

If you don't like Java, you will probably use this as an opportunity to say "told you so!". If you like Java but haven't worked much with regular expressions in Java, you will probably point me to the big and wonderful API for working with regular expressions and tell me to go read the documentation... surely, this API must be as nice as the rest of the Java APIs. If you both like Java and have worked seriously with regular expressions in Java, you will probably be nodding right now and saying "Ah yes... string literals".

See the problem with Java and regular expressions, isn't the functionality of its regular expression library. Its the fact that you don't have literals in the language to enter regular expressions that makes it a pain instead of a pleasure.

Let me show you an example. String objects in Java have a nice method to do replacements based on regular expressions. It takes two Strings as parameters. First the regular expression to search for, then another string to replace with. The replacement string can include $ characters to reference groups caught by the regular expression. So the following call would replace all a's with o's.

String str="Foo Bar";
str=str.replaceAll("a","o");

Simple right? So what happens when we want to replace a \ with \\ instead of an a with an o? Lets look at the regular expression part. First we need to pad the \ into a \\ to satisfy the regular expression as it sees the \ as a special character. Then we need to do the same for the Java string literal that we are using to enter it and we must naturally do that for both characters, so we end up with "\\\\" to match a single \.... and how about the replacement part? Well, the same thing applies there, except that we end up with twice the number of characters giving us the following two parameters for the replaceAll method.

str=str.replaceAll("\\\\","\\\\\\\\\");

Take a close look at that. Can you actually tell that there are exactly eight characters without using any assistance such as a finger or your mouse cursor to count them? If so, you need to reevaluate your own counting abilities, because there where actually nine characters - the last one was just thrown in to show how hard it actually is to count that many identical characters quickly. Lets take a look at the correct version for a second.

str=str.replaceAll("\\\\","\\\\\\\\");

It still looks silly doesn't it? Remember that this is just a very simple expression trying to replace a single character with two of the same kind. For this very simple example we could actually have grouped the match and referenced it as follows.

str=str.replaceAll("(\\\\)","$1$1");

However, once you start doing any serious regular expression work that includes characters that need escaping for either the regular expression, the string literal or both things get ugly very quickly. The sad thing is that it really doesn't have to be this hard, there is really no reason why we couldn't have regular expression literals in Java... perhaps even with a customizable escape character if we were really lucky.

4 comments:

Anonymous said...

This is exactly why Python has raw strings (r-strings), which are prefixed with an 'r'. From the docs:

When an "r" or "R" prefix is present, a character following a backslash is included in the string without change, and all backslashes are left in the string. For example, the string literal r"\n" consists of two characters: a backslash and a lowercase "n". String quotes can be escaped with a backslash, but the backslash remains in the string; for example, r"\"" is a valid string literal consisting of two characters: a backslash and a double quote;

Kasper Jeppesen said...

Nice, although it doesn't actually seem like much more than just another name for the here document literal present in many other language. For a true raw string literal, have look at Forths Hollerith notation. ;-)

However, I must say that I find Pythons raw string syntax very attractively compact.

I also like how several other languages such as Ruby and Oracles PL/SQL allow you to enter string literals where you choose the quoting character to maintain a good readability.

Unknown said...

> This is exactly why Python has raw strings (r-strings), which are prefixed with an 'r'.

Indeed. This is also why Perl, Javascript and Ruby have a regular-expression literal (/foo/), which also allows for easy modifier (want to have a case-insensitive global multiline regex? /foo/igm).

And why even C# has rawstrings much like Python's: if you prefix a string litteral with "@" (e.g. @"foo") then it becomes a rawstring and "\" isn't a string escape anymore. One of the great things is that the "@" rawstring modifier also counts as a multiline string modifier: a regular C# string is single-line (Java-like), but an @-prefixed string automatically becomes multiline-enabled, which allows you to write much cleaner multiline commented REs e.g.

string pat = @"
( # start the first group
abra # match the literal 'abra'
( # start the second (inner) group
cad # match the literal 'cad'
)? # end the second (optional) group
) # end the first group
+ # match one or more occurences
";

jrockway said...

> str=str.replaceAll("\\\\","\\\\\\\\\");

Heh, emacs lisp is even better. Capturing parens need to be backslashed for the regex engine, but backslashes need to be backslashed for quoted strings. Not fun.