Improving std.regex(p)

Andrei Alexandrescu SeeWebsiteForEmail at erdani.org
Thu Jun 17 21:44:03 PDT 2010


There are currently two regexen in the standard library. The older one, 
std.regexp, is time-tested but only works with UTF8 and has a clunkier 
API. The newer one, std.regex, is newer and isolates the engine from the 
matches (and therefore can reuse and cache engines easier), and supports 
all character widths. But it's less tested and doesn't have that great 
of an interface because it pretty much inherits the existing one.

I wish to improve regex handling in Phobos. The most important 
improvement is not in the interface - it's in the engine. The current 
engine is adequate but nothing to write home about, and for simple 
regexen is markedly slower than equivalent hand-written code (e.g. 
matching whitespace). One great opportunity would be for D to leverage 
its uncanny compile-time evaluation abilities and offer a regex that 
parses the pattern during compilation:

foreach (s; splitter(line, sregex!",[ \t\r]*")) { ... }

Such a static regex could be simpler than a full-blown regex with 
captures and backreferences etc., but it would have guaranteed 
performance (e.g. it would be an automaton instead of a backtracking 
engine) and would be darn fast because it would generate custom code for 
each regex pattern.

See related work:

http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-regular.html

If we get as far as implementing what RE2 can do with compile-time 
evaluation, people will definitely notice.

If there's anyone who'd want to tackle such a project (for Phobos or 
not), I highly encourage you to do so.


Andrei


More information about the Digitalmars-d mailing list