Is str ~ regex the root of all evil, or the leaf of all good?
Bill Baxter
wbaxter at gmail.com
Wed Feb 18 21:50:06 PST 2009
On Thu, Feb 19, 2009 at 2:35 PM, Andrei Alexandrescu
<SeeWebsiteForEmail at erdani.org> wrote:
> I'm almost done rewriting the regular expression engine, and some pretty
> interesting things have transpired.
>
> First, I separated the engine into two parts, one that is the actual regular
> expression engine, and the other that is the state of the match with some
> particular input. The previous code combined the two into a huge class. The
> engine (written by Walter) translates the regex string into a
> bytecode-compiled form. Given that there is a deterministic correspondence
> between the regex string and the bytecode, the Regex engine object is in
> fact invariant and cached by the implementation. Caching makes for
> significant time savings even if e.g. the user repeatedly creates a regular
> expression engine in a loop.
>
> In contrast, the match state depends on the input string. I defined it to
> implement the range interface, so you can either inspect it directly or
> iterate it for all matches (if the "g" option was passed to the engine).
>
> The new codebase works with char, wchar, and dchar and any random-access
> range as input (forward ranges to come, and at some point in the future
> input ranges as well). In spite of the added flexibility, the code size has
> shrunk from 3396 lines to 2912 lines. I plan to add support for binary data
> (e.g. ubyte - handling binary file formats can benefit a LOT from regexes)
> and also, probably unprecedented, support for arbitrary types such as
> integers, floating point numbers, structs, what have you. any type that
> supports comparison and ranges is a good candidate for regular expression
> matching. I'm not sure how regular expression matching can be harnessed e.g.
> over arrays of int, but I suspect some pretty cool applications are just
> around the corner. We can introduce that generalization without adding
> complexity and there is nothing in principle opposed to it.
>
> The interface is very simple, mainly consisting of the functions regex(),
> match(), and sub(), e.g.
>
> foreach (e; match("abracazoo", regex("a[b-e]", "g")))
> writeln(e.pre, e.hit, e.post);
> auto s = sub("abracazoo", regex("a([b-e])", "g"), "A$1");
>
> Two other syntactic options are available:
>
> "abracazoo".match(regex("a[b-e]", "g")))
> "abracazoo".match("a[b-e]", "g")
>
> I could have made match a member of regex:
>
> regex("a[b-e]", "g")).match("abracazoo")
>
> but most regex code I've seen mentions the string first and the regex
> second. So I dropped that idea.
>
> Now, match() is likely to be called very often so I'm considering:
>
> foreach (e; "abracazoo" ~ regex("a[b-e]", "g"))
> writeln(e);
>
> In general I'm weary of unwitting operator overloading, but I think this
> case is more justified than others. Thoughts?
No. ~ means matching in Perl. In D it means concatenation. This
special case is not special enough to warrant breaking D's convention,
in my opinion. It also breaks D's convention that operators have an
inherent meaning which shouldn't be subverted to do unrelated things.
What about turning it around and using 'in' though?
foreach(e; regex("a[b-e]", "g") in "abracazoo")
writeln(e);
The charter for "in" isn't quite as focused as that for ~, and anyway
you could view this as finding instances of the regular expression
"in" the string.
--bb
More information about the Digitalmars-d
mailing list