Is str ~ regex the root of all evil, or the leaf of all good?
Denis Koroskin
2korden at gmail.com
Thu Feb 19 00:30:11 PST 2009
On Thu, 19 Feb 2009 08:35:20 +0300, Andrei Alexandrescu <SeeWebsiteForEmail at erdani.org> wrote:
> I'm almost done rewriting the regular expression engine, and some pretty
> interesting things have transpired.
>
> First, I separated the engine into two parts, one that is the actual
> regular expression engine, and the other that is the state of the match
> with some particular input. The previous code combined the two into a
> huge class. The engine (written by Walter) translates the regex string
> into a bytecode-compiled form. Given that there is a deterministic
> correspondence between the regex string and the bytecode, the Regex
> engine object is in fact invariant and cached by the implementation.
> Caching makes for significant time savings even if e.g. the user
> repeatedly creates a regular expression engine in a loop.
>
> In contrast, the match state depends on the input string. I defined it
> to implement the range interface, so you can either inspect it directly
> or iterate it for all matches (if the "g" option was passed to the
> engine).
>
> The new codebase works with char, wchar, and dchar and any random-access
> range as input (forward ranges to come, and at some point in the future
> input ranges as well). In spite of the added flexibility, the code size
> has shrunk from 3396 lines to 2912 lines. I plan to add support for
> binary data (e.g. ubyte - handling binary file formats can benefit a LOT
> from regexes) and also, probably unprecedented, support for arbitrary
> types such as integers, floating point numbers, structs, what have you.
> any type that supports comparison and ranges is a good candidate for
> regular expression matching. I'm not sure how regular expression
> matching can be harnessed e.g. over arrays of int, but I suspect some
> pretty cool applications are just around the corner. We can introduce
> that generalization without adding complexity and there is nothing in
> principle opposed to it.
>
> The interface is very simple, mainly consisting of the functions
> regex(), match(), and sub(), e.g.
>
> foreach (e; match("abracazoo", regex("a[b-e]", "g")))
> writeln(e.pre, e.hit, e.post);
> auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1");
>
> Two other syntactic options are available:
>
> "abracazoo".match(regex("a[b-e]", "g")))
> "abracazoo".match("a[b-e]", "g")
>
> I could have made match a member of regex:
>
> regex("a[b-e]", "g")).match("abracazoo")
>
> but most regex code I've seen mentions the string first and the regex
> second. So I dropped that idea.
>
> Now, match() is likely to be called very often so I'm considering:
>
> foreach (e; "abracazoo" ~ regex("a[b-e]", "g"))
> writeln(e);
>
> In general I'm weary of unwitting operator overloading, but I think this
> case is more justified than others. Thoughts?
>
>
> Andrei
"abracazoo".match("a[b-e]", "g") is as short as "abracazoo" ~ regex("a[b-e]", "g") but doesn't existing conventions. I prefer it over '~' version. In is also fine (both ways).
More information about the Digitalmars-d
mailing list