Is str ~ regex the root of all evil, or the leaf of all good?

Denis Koroskin 2korden at gmail.com
Thu Feb 19 00:31:42 PST 2009


On Thu, 19 Feb 2009 08:35:20 +0300, Andrei Alexandrescu <SeeWebsiteForEmail at erdani.org> wrote:

> I'm almost done rewriting the regular expression engine, and some pretty  
> interesting things have transpired.
>
> First, I separated the engine into two parts, one that is the actual  
> regular expression engine, and the other that is the state of the match  
> with some particular input. The previous code combined the two into a  
> huge class. The engine (written by Walter) translates the regex string  
> into a bytecode-compiled form. Given that there is a deterministic  
> correspondence between the regex string and the bytecode, the Regex  
> engine object is in fact invariant and cached by the implementation.  
> Caching makes for significant time savings even if e.g. the user  
> repeatedly creates a regular expression engine in a loop.
>
> In contrast, the match state depends on the input string. I defined it  
> to implement the range interface, so you can either inspect it directly  
> or iterate it for all matches (if the "g" option was passed to the  
> engine).
>
> The new codebase works with char, wchar, and dchar and any random-access  
> range as input (forward ranges to come, and at some point in the future  
> input ranges as well). In spite of the added flexibility, the code size  
> has shrunk from 3396 lines to 2912 lines. I plan to add support for  
> binary data (e.g. ubyte - handling binary file formats can benefit a LOT  
> from regexes) and also, probably unprecedented, support for arbitrary  
> types such as integers, floating point numbers, structs, what have you.  
> any type that supports comparison and ranges is a good candidate for  
> regular expression matching. I'm not sure how regular expression  
> matching can be harnessed e.g. over arrays of int, but I suspect some  
> pretty cool applications are just around the corner. We can introduce  
> that generalization without adding complexity and there is nothing in  
> principle opposed to it.
>
> The interface is very simple, mainly consisting of the functions  
> regex(), match(), and sub(), e.g.
>
> foreach (e; match("abracazoo", regex("a[b-e]", "g")))
>      writeln(e.pre, e.hit, e.post);
> auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1");
>
> Two other syntactic options are available:
>
> "abracazoo".match(regex("a[b-e]", "g")))
> "abracazoo".match("a[b-e]", "g")
>
> I could have made match a member of regex:
>
> regex("a[b-e]", "g")).match("abracazoo")
>
> but most regex code I've seen mentions the string first and the regex  
> second. So I dropped that idea.
>
> Now, match() is likely to be called very often so I'm considering:
>
> foreach (e; "abracazoo" ~ regex("a[b-e]", "g"))
>      writeln(e);
>
> In general I'm weary of unwitting operator overloading, but I think this  
> case is more justified than others. Thoughts?
>
>
> Andrei

"abracazoo".match("a[b-e]", "g") is as short as "abracazoo" ~ regex("a[b-e]", "g") but doesn't break existing conventions. I prefer it over '~' version. 'in' is also fine (both ways).




More information about the Digitalmars-d mailing list