Is str ~ regex the root of all evil, or the leaf of all good?

Thu Feb 19 04:14:00 PST 2009

Andrei Alexandrescu wrote:
> I'm almost done rewriting the regular expression engine, and some pretty 
> interesting things have transpired.
> 
> First, I separated the engine into two parts, one that is the actual 
> regular expression engine, and the other that is the state of the match 
> with some particular input. The previous code combined the two into a 
> huge class. The engine (written by Walter) translates the regex string 
> into a bytecode-compiled form. Given that there is a deterministic 
> correspondence between the regex string and the bytecode, the Regex 
> engine object is in fact invariant and cached by the implementation. 
> Caching makes for significant time savings even if e.g. the user 
> repeatedly creates a regular expression engine in a loop.
> 
> In contrast, the match state depends on the input string. I defined it 
> to implement the range interface, so you can either inspect it directly 
> or iterate it for all matches (if the "g" option was passed to the engine).
> 
> The new codebase works with char, wchar, and dchar and any random-access 
> range as input (forward ranges to come, and at some point in the future 
> input ranges as well). In spite of the added flexibility, the code size 
> has shrunk from 3396 lines to 2912 lines. I plan to add support for 
> binary data (e.g. ubyte - handling binary file formats can benefit a LOT 
> from regexes) and also, probably unprecedented, support for arbitrary 
> types such as integers, floating point numbers, structs, what have you. 
> any type that supports comparison and ranges is a good candidate for 
> regular expression matching. I'm not sure how regular expression 
> matching can be harnessed e.g. over arrays of int, but I suspect some 
> pretty cool applications are just around the corner. We can introduce 
> that generalization without adding complexity and there is nothing in 
> principle opposed to it.
> 
> The interface is very simple, mainly consisting of the functions 
> regex(), match(), and sub(), e.g.
> 
> foreach (e; match("abracazoo", regex("a[b-e]", "g")))
>     writeln(e.pre, e.hit, e.post);
> auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1");
> 
> Two other syntactic options are available:
> 
> "abracazoo".match(regex("a[b-e]", "g")))
> "abracazoo".match("a[b-e]", "g")
> 
> I could have made match a member of regex:
> 
> regex("a[b-e]", "g")).match("abracazoo")
> 
> but most regex code I've seen mentions the string first and the regex 
> second. So I dropped that idea.
> 
> Now, match() is likely to be called very often so I'm considering:
> 
> foreach (e; "abracazoo" ~ regex("a[b-e]", "g"))
>     writeln(e);
> 
> In general I'm weary of unwitting operator overloading, but I think this 
> case is more justified than others. Thoughts?
> 
> 
> Andrei

I agree with the comments against ~.
I believe this Perl6 document is a must-read:

http://dev.perl.org/perl6/doc/design/apo/A05.html

There are some excellent observations there, especially near the 
beginning. By separating the engine from the state of the match, you 
open the possibilty of subsequently providing cleaner regex syntax.

I do wonder though, how you'd deal with a regex which includes a match 
to a literal string provided as a variable. Would this be passed to the 
engine, or to the match state?
If the engine is using backtracking, there's no difference in the 
generated bytecode; but if it's creating an automata, the compiled 
engine depends on the contents of the string variable.