Is str ~ regex the root of all evil, or the leaf of all good?

Max Samukha samukha at voliacable.com.removethis
Thu Feb 19 00:57:36 PST 2009


On Wed, 18 Feb 2009 21:35:20 -0800, Andrei Alexandrescu
<SeeWebsiteForEmail at erdani.org> wrote:

>I'm almost done rewriting the regular expression engine, and some pretty 
>interesting things have transpired.
>
>First, I separated the engine into two parts, one that is the actual 
>regular expression engine, and the other that is the state of the match 
>with some particular input. The previous code combined the two into a 
>huge class. The engine (written by Walter) translates the regex string 
>into a bytecode-compiled form. Given that there is a deterministic 
>correspondence between the regex string and the bytecode, the Regex 
>engine object is in fact invariant and cached by the implementation. 
>Caching makes for significant time savings even if e.g. the user 
>repeatedly creates a regular expression engine in a loop.
>
>In contrast, the match state depends on the input string. I defined it 
>to implement the range interface, so you can either inspect it directly 
>or iterate it for all matches (if the "g" option was passed to the engine).
>
>The new codebase works with char, wchar, and dchar and any random-access 
>range as input (forward ranges to come, and at some point in the future 
>input ranges as well). In spite of the added flexibility, the code size 
>has shrunk from 3396 lines to 2912 lines. I plan to add support for 
>binary data (e.g. ubyte - handling binary file formats can benefit a LOT 
>from regexes) and also, probably unprecedented, support for arbitrary 
>types such as integers, floating point numbers, structs, what have you. 
>any type that supports comparison and ranges is a good candidate for 
>regular expression matching. I'm not sure how regular expression 
>matching can be harnessed e.g. over arrays of int, but I suspect some 
>pretty cool applications are just around the corner. We can introduce 
>that generalization without adding complexity and there is nothing in 
>principle opposed to it.
>
>The interface is very simple, mainly consisting of the functions 
>regex(), match(), and sub(), e.g.
>
>foreach (e; match("abracazoo", regex("a[b-e]", "g")))
>     writeln(e.pre, e.hit, e.post);
>auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1");
>
>Two other syntactic options are available:
>
>"abracazoo".match(regex("a[b-e]", "g")))
>"abracazoo".match("a[b-e]", "g")
>
>I could have made match a member of regex:
>
>regex("a[b-e]", "g")).match("abracazoo")
>
>but most regex code I've seen mentions the string first and the regex 
>second. So I dropped that idea.
>
>Now, match() is likely to be called very often so I'm considering:
>
>foreach (e; "abracazoo" ~ regex("a[b-e]", "g"))
>     writeln(e);
>
>In general I'm weary of unwitting operator overloading, but I think this 
>case is more justified than others. Thoughts?
>

Please anything but ~. It would be fine, if it didn't follow an array.
It could be 'in' as Bill suggested. Or maybe /, which reminds of sed
and downs:

foreach (e; "abracazoo"/regex("a[b-e]", "g"))
     writeln(e);

If you made 'match' a member of Regex, another option would be the
opCall:

regex("a[b-e]", "g")( "abracazoo")



More information about the Digitalmars-d mailing list