Improving std.regex(p)
Ellery Newcomer
ellery-newcomer at utulsa.edu
Sat Jun 19 12:47:58 PDT 2010
On 06/17/2010 11:44 PM, Andrei Alexandrescu wrote:
> There are currently two regexen in the standard library. The older one,
> std.regexp, is time-tested but only works with UTF8 and has a clunkier
> API. The newer one, std.regex, is newer and isolates the engine from the
> matches (and therefore can reuse and cache engines easier), and supports
> all character widths. But it's less tested and doesn't have that great
> of an interface because it pretty much inherits the existing one.
>
> I wish to improve regex handling in Phobos. The most important
> improvement is not in the interface - it's in the engine. The current
> engine is adequate but nothing to write home about, and for simple
> regexen is markedly slower than equivalent hand-written code (e.g.
> matching whitespace). One great opportunity would be for D to leverage
> its uncanny compile-time evaluation abilities and offer a regex that
> parses the pattern during compilation:
>
> foreach (s; splitter(line, sregex!",[ \t\r]*")) { ... }
>
> Such a static regex could be simpler than a full-blown regex with
> captures and backreferences etc., but it would have guaranteed
> performance (e.g. it would be an automaton instead of a backtracking
> engine) and would be darn fast because it would generate custom code for
> each regex pattern.
>
> See related work:
>
> http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-regular.html
>
>
> If we get as far as implementing what RE2 can do with compile-time
> evaluation, people will definitely notice.
>
> If there's anyone who'd want to tackle such a project (for Phobos or
> not), I highly encourage you to do so.
>
>
> Andrei
Generally I think D's CT capabilities have a way to go yet before this
would be worth tackling. E.g. how do you build a parse tree if pointers
and classes aren't allowed in CT code?
But, predicated on the assumption that it will happen relatively soon..
What do we want our regex language to look like?
this: http://digitalmars.com/ctg/regular.html
is okay, but I like the idea of ascii and unicode character classes as
per RE2's syntax, and I don't like the idea of an overloaded \n.
(I assume we're talking about strings as bidirectional unicode ranges)
and I would like to see the language grammar explicitly written out
somewhere.
and I like the idea of capture groups. Thinking of something along the
lines of
import std.regex;
int i;
double d;
ctcapture!r"(?<i>\d+): (?<d>\d+\.\d+)"( "10: 100.5"); //this will
probably have to be inside a mixin statement
assert(i == 10);
assert(d == 100.5); //grumph, floating point..
for that matter, why not some predefined subexpressions corresponding to
D's integer literals, et al? And available inside the regex?
ctcapture!r"(?<i>{:IntegerLiteral:}): (?<d>{:FloatLiteral:})" // Yeah, I
know. You come up with a better syntax
("10: 0xEA.A1p-2");
Okay, I'm done dreaming. I don't know how much of the above can be done
within the constraints of an FA. How much sense does regex over unicode
make? Do others have ideas for features for regex?
More information about the Digitalmars-d
mailing list