[GSOC idea] enhance regular expressions?
Dmitry Olshansky
dmitry.olsh at gmail.com
Mon Mar 28 14:15:48 PDT 2011
Hello,
(this is a long post, you may skip this introduction)
I was sticking around D's NG for about a year and following it's
advancements very closely. I was busy ruthlessly wasting my spare time
for testing this or that of cool features, converting some personal C++
projects. Maybe it's time to introduce myself: I'm student in the middle
of getting my master's degree in Moscow Institute of Physics and
Technology. My main interests for the last years were C++,
metaprogramming, compiler theory and, lately, robotics and
microcontrollers. I got attracted to D2 as a natural way to reinvest my
C++ knowledge (honestly, C++0x is such a mess) and, in fact, looked
forward to do some hackery on some real-world compiler (that would be
DMD). As sort of warm-up to get better grasp on the language and
real-world interpreters alike I did some work on DMDScript (porting it
to D2 among others) it's available on
http://dsource.org/projects/dmdscript-2. Note that I haven't touched it
in couple of dmd releases, it may need cosmetic fixes.
To summarize, I'm no stranger to regular expressions and parsing and
want to utilize this experience to help community, what follows is my
first attempt at proposal, any comments and ideas are warmly welcome.
--- somewhat informal draft ---
Having regular expression (regex) engine in standard library is totally
expected for any modern language.
The solid and performant implementation of regex is one of the greatest
points of pride (if not of sale).
While concrete results and benchmarks got frequently biased, there are
some general conclusions:
regexes usually came in two flavors: Finite Automation (FA, be it
deterministic or nondeterministic) and backtracking, each of these
with their set of traits.
FA are more stable O(N) in time on input, however they are usually more
limited as implementing features such as backreferences becomes
problematic, also not to mention the time for constructing DFA from pattern.
Backtracking allows easily implementing some interesting features like
backreferencing, lookahead/lookbehind and such, but has a dreadful
exponential time on input in worst case.
Not willing to go deeper into the performance/feature/usage topic,
fairly good picture of various regular expression engines can be found
here: http://lh3lh3.users.sourceforge.net/reb.shtml. I though have now
idea why he claims RE2 is feature-rich.
Now speaking of D, what we have now is essentially an optimized
backtracking with some not implemented features it may very well have
had. Also consider important issue that need proper resolution
http://d.puremagic.com/issues/show_bug.cgi?id=941. This situation begs
for a change.
The proposal consists of three parts:
1) Enhance and bugfix the existing engine in std.regex. Sometime ago I
got accustomed with this engine internals and confident that I can get
it done. That's a short-term solution at best and something I plan to do
in spare time even if it's not considered a compelling GSOC proposal ;)
2) Make another one FA-based from scratch, here things go somewhat
speculative.
I plan to start with NFA, leaving DFA on later. Probably NFA with cached
DFA states as optimization tuning aka memory vs time. Someone brought
http://swtch.com/~rsc/regexp/regexp1.html, it could be a starting point,
I see some optimization opportunities I'd like to try. Supporting
unicode of all flavors is of no question. The API for this engine may
stay the same or it even may be integrated into existing one, making
the type of engine a compile-time parameter with default being
Regex.Backtracking. Suppose:
auto dfa = regex!Regex.NFA("(D|N)?FA","g");
3) When I saw Don's comments on fixing CTFE, I remembered something D
can do _better_ than most others mainstream compiled languages -
metaprogramming. So having a StaticRegex compiled at compile-time
(inevitable calambour) would give us an edge over comparative
implementations. Also I can imagine quite a lot of applications not
requiring *any* regex from non-constant pattern. Major exceptions are
editors with find/replace and certain scripting/parsing toolkits.
It looks quite natural:
...
enum sregNum= StaticRegex!"[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)";
...
assert(match("PI: 3.1415926",sregNum).hit == "3.1415926");
The best speed gain most likely would be obtained by generating the
whole DFA at compile time, I believe such possibility was previously
discussed on the NG.
This diversity in kinds of regexes may warrant making regex engine an
interface ( for homogeneous storage and parameter passing) with concrete
implementations being classes.
Another thing to sort out is style: are we sticking with ECMA or
switching to Perl / GNU? Shall we provide also different versions of
syntax?
--
Dmitry Olshansky
More information about the Digitalmars-d
mailing list