Is str ~ regex the root of all evil, or the leaf of all good?
Max Samukha
samukha at voliacable.com.removethis
Thu Feb 19 00:57:36 PST 2009
On Wed, 18 Feb 2009 21:35:20 -0800, Andrei Alexandrescu
<SeeWebsiteForEmail at erdani.org> wrote:
>I'm almost done rewriting the regular expression engine, and some pretty
>interesting things have transpired.
>
>First, I separated the engine into two parts, one that is the actual
>regular expression engine, and the other that is the state of the match
>with some particular input. The previous code combined the two into a
>huge class. The engine (written by Walter) translates the regex string
>into a bytecode-compiled form. Given that there is a deterministic
>correspondence between the regex string and the bytecode, the Regex
>engine object is in fact invariant and cached by the implementation.
>Caching makes for significant time savings even if e.g. the user
>repeatedly creates a regular expression engine in a loop.
>
>In contrast, the match state depends on the input string. I defined it
>to implement the range interface, so you can either inspect it directly
>or iterate it for all matches (if the "g" option was passed to the engine).
>
>The new codebase works with char, wchar, and dchar and any random-access
>range as input (forward ranges to come, and at some point in the future
>input ranges as well). In spite of the added flexibility, the code size
>has shrunk from 3396 lines to 2912 lines. I plan to add support for
>binary data (e.g. ubyte - handling binary file formats can benefit a LOT
>from regexes) and also, probably unprecedented, support for arbitrary
>types such as integers, floating point numbers, structs, what have you.
>any type that supports comparison and ranges is a good candidate for
>regular expression matching. I'm not sure how regular expression
>matching can be harnessed e.g. over arrays of int, but I suspect some
>pretty cool applications are just around the corner. We can introduce
>that generalization without adding complexity and there is nothing in
>principle opposed to it.
>
>The interface is very simple, mainly consisting of the functions
>regex(), match(), and sub(), e.g.
>
>foreach (e; match("abracazoo", regex("a[b-e]", "g")))
> writeln(e.pre, e.hit, e.post);
>auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1");
>
>Two other syntactic options are available:
>
>"abracazoo".match(regex("a[b-e]", "g")))
>"abracazoo".match("a[b-e]", "g")
>
>I could have made match a member of regex:
>
>regex("a[b-e]", "g")).match("abracazoo")
>
>but most regex code I've seen mentions the string first and the regex
>second. So I dropped that idea.
>
>Now, match() is likely to be called very often so I'm considering:
>
>foreach (e; "abracazoo" ~ regex("a[b-e]", "g"))
> writeln(e);
>
>In general I'm weary of unwitting operator overloading, but I think this
>case is more justified than others. Thoughts?
>
Please anything but ~. It would be fine, if it didn't follow an array.
It could be 'in' as Bill suggested. Or maybe /, which reminds of sed
and downs:
foreach (e; "abracazoo"/regex("a[b-e]", "g"))
writeln(e);
If you made 'match' a member of Regex, another option would be the
opCall:
regex("a[b-e]", "g")( "abracazoo")
More information about the Digitalmars-d
mailing list