Is str ~ regex the root of all evil, or the leaf of all good?

Thu Feb 19 06:49:38 PST 2009

Don wrote:
> Andrei Alexandrescu wrote:
>> I'm almost done rewriting the regular expression engine, and some 
>> pretty interesting things have transpired.
>>
>> First, I separated the engine into two parts, one that is the actual 
>> regular expression engine, and the other that is the state of the 
>> match with some particular input. The previous code combined the two 
>> into a huge class. The engine (written by Walter) translates the regex 
>> string into a bytecode-compiled form. Given that there is a 
>> deterministic correspondence between the regex string and the 
>> bytecode, the Regex engine object is in fact invariant and cached by 
>> the implementation. Caching makes for significant time savings even if 
>> e.g. the user repeatedly creates a regular expression engine in a loop.
>>
>> In contrast, the match state depends on the input string. I defined it 
>> to implement the range interface, so you can either inspect it 
>> directly or iterate it for all matches (if the "g" option was passed 
>> to the engine).
>>
>> The new codebase works with char, wchar, and dchar and any 
>> random-access range as input (forward ranges to come, and at some 
>> point in the future input ranges as well). In spite of the added 
>> flexibility, the code size has shrunk from 3396 lines to 2912 lines. I 
>> plan to add support for binary data (e.g. ubyte - handling binary file 
>> formats can benefit a LOT from regexes) and also, probably 
>> unprecedented, support for arbitrary types such as integers, floating 
>> point numbers, structs, what have you. any type that supports 
>> comparison and ranges is a good candidate for regular expression 
>> matching. I'm not sure how regular expression matching can be 
>> harnessed e.g. over arrays of int, but I suspect some pretty cool 
>> applications are just around the corner. We can introduce that 
>> generalization without adding complexity and there is nothing in 
>> principle opposed to it.
>>
>> The interface is very simple, mainly consisting of the functions 
>> regex(), match(), and sub(), e.g.
>>
>> foreach (e; match("abracazoo", regex("a[b-e]", "g")))
>>     writeln(e.pre, e.hit, e.post);
>> auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1");
>>
>> Two other syntactic options are available:
>>
>> "abracazoo".match(regex("a[b-e]", "g")))
>> "abracazoo".match("a[b-e]", "g")
>>
>> I could have made match a member of regex:
>>
>> regex("a[b-e]", "g")).match("abracazoo")
>>
>> but most regex code I've seen mentions the string first and the regex 
>> second. So I dropped that idea.
>>
>> Now, match() is likely to be called very often so I'm considering:
>>
>> foreach (e; "abracazoo" ~ regex("a[b-e]", "g"))
>>     writeln(e);
>>
>> In general I'm weary of unwitting operator overloading, but I think 
>> this case is more justified than others. Thoughts?
>>
>>
>> Andrei
> 
> I agree with the comments against ~.
> I believe this Perl6 document is a must-read:
> 
> http://dev.perl.org/perl6/doc/design/apo/A05.html
> 
> There are some excellent observations there, especially near the 
> beginning. By separating the engine from the state of the match, you 
> open the possibilty of subsequently providing cleaner regex syntax.

I'd read it a while ago, but a refresher is in order. Thanks!

> I do wonder though, how you'd deal with a regex which includes a match 
> to a literal string provided as a variable. Would this be passed to the 
> engine, or to the match state?

At the moment these are not supported. It's a good question.

> If the engine is using backtracking, there's no difference in the 
> generated bytecode; but if it's creating an automata, the compiled 
> engine depends on the contents of the string variable.

The current engine is, to the best of my understanding, using 
backtracking. At least when there's an "or", it tries both matches as 
recursive calls and picks the longest.

Andrei