Is str ~ regex the root of all evil, or the leaf of all good?
Andrei Alexandrescu
SeeWebsiteForEmail at erdani.org
Wed Feb 18 21:35:20 PST 2009
I'm almost done rewriting the regular expression engine, and some pretty
interesting things have transpired.
First, I separated the engine into two parts, one that is the actual
regular expression engine, and the other that is the state of the match
with some particular input. The previous code combined the two into a
huge class. The engine (written by Walter) translates the regex string
into a bytecode-compiled form. Given that there is a deterministic
correspondence between the regex string and the bytecode, the Regex
engine object is in fact invariant and cached by the implementation.
Caching makes for significant time savings even if e.g. the user
repeatedly creates a regular expression engine in a loop.
In contrast, the match state depends on the input string. I defined it
to implement the range interface, so you can either inspect it directly
or iterate it for all matches (if the "g" option was passed to the engine).
The new codebase works with char, wchar, and dchar and any random-access
range as input (forward ranges to come, and at some point in the future
input ranges as well). In spite of the added flexibility, the code size
has shrunk from 3396 lines to 2912 lines. I plan to add support for
binary data (e.g. ubyte - handling binary file formats can benefit a LOT
from regexes) and also, probably unprecedented, support for arbitrary
types such as integers, floating point numbers, structs, what have you.
any type that supports comparison and ranges is a good candidate for
regular expression matching. I'm not sure how regular expression
matching can be harnessed e.g. over arrays of int, but I suspect some
pretty cool applications are just around the corner. We can introduce
that generalization without adding complexity and there is nothing in
principle opposed to it.
The interface is very simple, mainly consisting of the functions
regex(), match(), and sub(), e.g.
foreach (e; match("abracazoo", regex("a[b-e]", "g")))
writeln(e.pre, e.hit, e.post);
auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1");
Two other syntactic options are available:
"abracazoo".match(regex("a[b-e]", "g")))
"abracazoo".match("a[b-e]", "g")
I could have made match a member of regex:
regex("a[b-e]", "g")).match("abracazoo")
but most regex code I've seen mentions the string first and the regex
second. So I dropped that idea.
Now, match() is likely to be called very often so I'm considering:
foreach (e; "abracazoo" ~ regex("a[b-e]", "g"))
writeln(e);
In general I'm weary of unwitting operator overloading, but I think this
case is more justified than others. Thoughts?
Andrei
More information about the Digitalmars-d
mailing list