[GSOC idea] enhance regular expressions?

Mon Mar 28 14:15:48 PDT 2011

Hello,
(this is a long post, you may skip this introduction)
I was sticking around D's NG for about a year and following it's 
advancements very closely. I was busy ruthlessly wasting my spare time 
for testing this or that of cool features,  converting some personal C++ 
projects. Maybe it's time to introduce myself: I'm student in the middle 
of getting my master's degree in Moscow Institute of Physics and 
Technology. My main interests for the last years were C++, 
metaprogramming, compiler theory and, lately, robotics and 
microcontrollers. I got attracted to D2 as a natural way to reinvest my 
C++ knowledge (honestly, C++0x is such a mess) and, in fact, looked 
forward to do some hackery on some real-world compiler (that would be 
DMD). As sort of warm-up to get better grasp on the language and 
real-world interpreters alike I did some work on DMDScript (porting it 
to D2 among others) it's available on 
http://dsource.org/projects/dmdscript-2. Note that I haven't touched it 
in couple of dmd releases, it may need cosmetic fixes.

To summarize, I'm no stranger to regular expressions and parsing and 
want to utilize this experience to help community, what follows is my 
first attempt at proposal, any comments and ideas are warmly welcome.

--- somewhat informal draft ---

Having regular expression (regex) engine in standard library is totally 
expected for any modern language.
The solid and performant implementation of regex is one of the greatest 
points of pride (if not of sale).
While concrete results and benchmarks got frequently biased, there are 
some general conclusions:
regexes usually came in two flavors: Finite Automation (FA, be it 
deterministic  or nondeterministic) and  backtracking, each of these 
with their set of traits.
FA are more stable O(N) in time on input, however they are usually more 
limited as implementing features such as backreferences becomes 
problematic, also not to mention the time for constructing DFA from pattern.
Backtracking allows easily implementing some interesting features like 
backreferencing, lookahead/lookbehind and such, but has a dreadful 
exponential time on input in worst case.
Not willing to go deeper into the performance/feature/usage topic,  
fairly good picture of various regular expression engines can be found 
here:  http://lh3lh3.users.sourceforge.net/reb.shtml. I though have now 
idea why he claims RE2 is feature-rich.

Now speaking of D, what we have now is essentially an optimized 
backtracking with some not implemented features it may very well have 
had. Also consider important issue that need proper resolution 
http://d.puremagic.com/issues/show_bug.cgi?id=941. This situation begs 
for a change.

The proposal consists of three parts:

1) Enhance and bugfix the existing engine in std.regex. Sometime ago I 
got accustomed with this engine internals and confident that I can get 
it done. That's a short-term solution at best and something I plan to do 
in spare time even if it's not considered a compelling GSOC proposal ;)

2)  Make another one FA-based from scratch, here things go somewhat 
speculative.
I plan to start with NFA, leaving DFA on later. Probably NFA with cached 
DFA states as optimization tuning aka memory vs time. Someone brought  
http://swtch.com/~rsc/regexp/regexp1.html, it could be a starting point, 
I see some optimization opportunities I'd like to try. Supporting 
unicode of all flavors is of no question. The API for this engine may 
stay the same or it even  may be integrated into existing one, making 
the type of engine a compile-time parameter with default being 
Regex.Backtracking. Suppose:
auto dfa = regex!Regex.NFA("(D|N)?FA","g");

3) When I saw Don's comments on fixing CTFE, I remembered something D 
can do _better_  than most others mainstream compiled languages - 
metaprogramming. So having a StaticRegex compiled at compile-time 
(inevitable calambour) would give us an  edge over comparative 
implementations. Also I can imagine quite a lot of applications not 
requiring *any* regex from non-constant pattern. Major exceptions are 
editors with find/replace and certain scripting/parsing toolkits.
It looks quite natural:
...
enum sregNum= StaticRegex!"[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)";
...
assert(match("PI: 3.1415926",sregNum).hit == "3.1415926");

The best speed gain most likely would be obtained by generating the 
whole DFA at compile time, I believe such possibility  was previously 
discussed on the NG.

This diversity in kinds of regexes may warrant making regex engine an 
interface ( for homogeneous storage and parameter passing) with concrete 
implementations being classes.

Another thing to sort out is style: are we sticking with ECMA or 
switching to Perl / GNU? Shall we provide also different versions of  
syntax?

-- 
Dmitry Olshansky