Is str ~ regex the root of all evil, or the leaf of all good?

Andrei Alexandrescu SeeWebsiteForEmail at erdani.org
Wed Feb 18 21:35:20 PST 2009


I'm almost done rewriting the regular expression engine, and some pretty 
interesting things have transpired.

First, I separated the engine into two parts, one that is the actual 
regular expression engine, and the other that is the state of the match 
with some particular input. The previous code combined the two into a 
huge class. The engine (written by Walter) translates the regex string 
into a bytecode-compiled form. Given that there is a deterministic 
correspondence between the regex string and the bytecode, the Regex 
engine object is in fact invariant and cached by the implementation. 
Caching makes for significant time savings even if e.g. the user 
repeatedly creates a regular expression engine in a loop.

In contrast, the match state depends on the input string. I defined it 
to implement the range interface, so you can either inspect it directly 
or iterate it for all matches (if the "g" option was passed to the engine).

The new codebase works with char, wchar, and dchar and any random-access 
range as input (forward ranges to come, and at some point in the future 
input ranges as well). In spite of the added flexibility, the code size 
has shrunk from 3396 lines to 2912 lines. I plan to add support for 
binary data (e.g. ubyte - handling binary file formats can benefit a LOT 
from regexes) and also, probably unprecedented, support for arbitrary 
types such as integers, floating point numbers, structs, what have you. 
any type that supports comparison and ranges is a good candidate for 
regular expression matching. I'm not sure how regular expression 
matching can be harnessed e.g. over arrays of int, but I suspect some 
pretty cool applications are just around the corner. We can introduce 
that generalization without adding complexity and there is nothing in 
principle opposed to it.

The interface is very simple, mainly consisting of the functions 
regex(), match(), and sub(), e.g.

foreach (e; match("abracazoo", regex("a[b-e]", "g")))
     writeln(e.pre, e.hit, e.post);
auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1");

Two other syntactic options are available:

"abracazoo".match(regex("a[b-e]", "g")))
"abracazoo".match("a[b-e]", "g")

I could have made match a member of regex:

regex("a[b-e]", "g")).match("abracazoo")

but most regex code I've seen mentions the string first and the regex 
second. So I dropped that idea.

Now, match() is likely to be called very often so I'm considering:

foreach (e; "abracazoo" ~ regex("a[b-e]", "g"))
     writeln(e);

In general I'm weary of unwitting operator overloading, but I think this 
case is more justified than others. Thoughts?


Andrei



More information about the Digitalmars-d mailing list