[GSOC idea] enhance regular expressions?

Thu Mar 31 11:16:05 PDT 2011

Dmitry Olshansky <dmitry.olsh at gmail.com> writes:
> --- somewhat informal draft ---
>

Hi Dmitry,

It's good to know that there is interest in improving regular
expressions in the D standard library. I've run into a number of
problems using std.regex myself, so I'm keen so see either fixes for
std.regex or an improved replacement.

I've been working on a regular expression library myself based around
Russ Cox article at http://swtch.com/~rsc/regexp/regexp2.html and the
std.regex range interface. It includes a backtracking and Thompson
engine. If you are interested it is on github:

https://github.com/PeteChadwick/pcregexeng

The ddoc documentation is here:
http://petechadwick.github.com/pcregexeng/

> Having regular expression (regex) engine in standard library is
> totally expected for any modern language.
> The solid and performant implementation of regex is one of the
> greatest points of pride (if not of sale).

Couldn't agree more.

> While concrete results and benchmarks got frequently biased, there are
> some general conclusions:
> regexes usually came in two flavors: Finite Automation (FA, be it
> deterministic  or nondeterministic) and  backtracking, each of these
> with their set of traits.
> FA are more stable O(N) in time on input, however they are usually
> more limited as implementing features such as backreferences becomes
> problematic, also not to mention the time for constructing DFA from
> pattern.

For an NFA, the complexity of the regular expression also affects the
execution time as there will probably me more states to process per
character. There is also more storage and copying required for
captures. I haven't looked closely at using a DFA, and I'm not sure how
captures would work with one.

> Backtracking allows easily implementing some interesting features like
> backreferencing, lookahead/lookbehind and such, but has a dreadful
> exponential time on input in worst case.
> Not willing to go deeper into the performance/feature/usage topic,
> fairly good picture of various regular expression engines can be found
> here:  http://lh3lh3.users.sourceforge.net/reb.shtml. I though have
> now idea why he claims RE2 is feature-rich.
>
> Now speaking of D, what we have now is essentially an optimized
> backtracking with some not implemented features it may very well have
> had. Also consider important issue that need proper resolution
> http://d.puremagic.com/issues/show_bug.cgi?id=941. This situation begs
> for a change.
>
> The proposal consists of three parts:
>
> 1) Enhance and bugfix the existing engine in std.regex. Sometime ago I
> got accustomed with this engine internals and confident that I can get
> it done. That's a short-term solution at best and something I plan to
> do in spare time even if it's not considered a compelling GSOC
> proposal ;)

I would really like to see this.

> 2)  Make another one FA-based from scratch, here things go somewhat
> speculative.
> I plan to start with NFA, leaving DFA on later. Probably NFA with
> cached DFA states as optimization tuning aka memory vs time. Someone
> brought  http://swtch.com/~rsc/regexp/regexp1.html, it could be a
> starting point, I see some optimization opportunities I'd like to
> try. Supporting unicode of all flavors is of no question. The API for
> this engine may stay the same or it even  may be integrated into
> existing one, making the type of engine a compile-time parameter with
> default being Regex.Backtracking. Suppose:
> auto dfa = regex!Regex.NFA("(D|N)?FA","g");

I think the existing range based API is very good. I also like the idea
of the regular expression engine being a compile time parameter.

> 3) When I saw Don's comments on fixing CTFE, I remembered something D
> can do _better_  than most others mainstream compiled languages - 
> metaprogramming. So having a StaticRegex compiled at compile-time
> (inevitable calambour) would give us an  edge over comparative
> implementations. Also I can imagine quite a lot of applications not
> requiring *any* regex from non-constant pattern. Major exceptions are
> editors with find/replace and certain scripting/parsing toolkits.
> It looks quite natural:
> ...
> enum sregNum= StaticRegex!"[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)";
> ...
> assert(match("PI: 3.1415926",sregNum).hit == "3.1415926");
>
> The best speed gain most likely would be obtained by generating the
> whole DFA at compile time, I believe such possibility  was previously
> discussed on the NG.

This would be a real nice feature.

> This diversity in kinds of regexes may warrant making regex engine an
> interface ( for homogeneous storage and parameter passing) with
> concrete implementations being classes.
>
> Another thing to sort out is style: are we sticking with ECMA or
> switching to Perl / GNU? Shall we provide also different versions of
> syntax?