[GSOC idea] enhance regular expressions?

Dmitry Olshansky dmitry.olsh at gmail.com
Thu Mar 31 14:05:28 PDT 2011


On 31.03.2011 22:16, petevik38 at yahoo.com.au wrote:
> Dmitry Olshansky<dmitry.olsh at gmail.com>  writes:
>> --- somewhat informal draft ---
>>
> Hi Dmitry,
>
> It's good to know that there is interest in improving regular
> expressions in the D standard library. I've run into a number of
> problems using std.regex myself, so I'm keen so see either fixes for
> std.regex or an improved replacement.
Thanks for the interest.
Indeed there are problems, though the engine itself is quite capable and 
optimized, I think it could be fixed.
> I've been working on a regular expression library myself based around
> Russ Cox article at http://swtch.com/~rsc/regexp/regexp2.html and the
> std.regex range interface. It includes a backtracking and Thompson
> engine. If you are interested it is on github:
>
> https://github.com/PeteChadwick/pcregexeng
>
> The ddoc documentation is here:
> http://petechadwick.github.com/pcregexeng/
>
Wow, I'll definitely look into it.
Overall it seems solid even though incomplete. From glance it seems that 
parsing a pattern is somewhat convulted, but I'll defer any critics or 
decisions until I looked in closer detail.
I see you also using the test from std.regex, sadly it's somewhat broken 
(the one with test vectors).
All it's checking proves that *there is a match*, not *what* was matched.

Anyway, I'm gathering what I have in a way of fixes as my first pull 
request, wish me luck ;)
>> Having regular expression (regex) engine in standard library is
>> totally expected for any modern language.
>> The solid and performant implementation of regex is one of the
>> greatest points of pride (if not of sale).
> Couldn't agree more.
>
>> While concrete results and benchmarks got frequently biased, there are
>> some general conclusions:
>> regexes usually came in two flavors: Finite Automation (FA, be it
>> deterministic  or nondeterministic) and  backtracking, each of these
>> with their set of traits.
>> FA are more stable O(N) in time on input, however they are usually
>> more limited as implementing features such as backreferences becomes
>> problematic, also not to mention the time for constructing DFA from
>> pattern.
> For an NFA, the complexity of the regular expression also affects the
> execution time as there will probably me more states to process per
> character. There is also more storage and copying required for
> captures. I haven't looked closely at using a DFA, and I'm not sure how
> captures would work with one.
>
Right, for NFA the execution time in worst case O(n*m) n - input, m - 
pattern length.
As for DFA, it seems the process of construction full DFA eagerly from 
NFA would grab a lot of ram & time O(2^^m) in worst case, which makes it 
useful only on certain amounts of input, hence the idea to compute it at 
compile time.

>> Backtracking allows easily implementing some interesting features like
>> backreferencing, lookahead/lookbehind and such, but has a dreadful
>> exponential time on input in worst case.
>> Not willing to go deeper into the performance/feature/usage topic,
>> fairly good picture of various regular expression engines can be found
>> here:  http://lh3lh3.users.sourceforge.net/reb.shtml. I though have
>> now idea why he claims RE2 is feature-rich.
>>
>> Now speaking of D, what we have now is essentially an optimized
>> backtracking with some not implemented features it may very well have
>> had. Also consider important issue that need proper resolution
>> http://d.puremagic.com/issues/show_bug.cgi?id=941. This situation begs
>> for a change.
>>
>> The proposal consists of three parts:
>>
>> 1) Enhance and bugfix the existing engine in std.regex. Sometime ago I
>> got accustomed with this engine internals and confident that I can get
>> it done. That's a short-term solution at best and something I plan to
>> do in spare time even if it's not considered a compelling GSOC
>> proposal ;)
> I would really like to see this.
>
>> 2)  Make another one FA-based from scratch, here things go somewhat
>> speculative.
>> I plan to start with NFA, leaving DFA on later. Probably NFA with
>> cached DFA states as optimization tuning aka memory vs time. Someone
>> brought  http://swtch.com/~rsc/regexp/regexp1.html, it could be a
>> starting point, I see some optimization opportunities I'd like to
>> try. Supporting unicode of all flavors is of no question. The API for
>> this engine may stay the same or it even  may be integrated into
>> existing one, making the type of engine a compile-time parameter with
>> default being Regex.Backtracking. Suppose:
>> auto dfa = regex!Regex.NFA("(D|N)?FA","g");
> I think the existing range based API is very good. I also like the idea
> of the regular expression engine being a compile time parameter.
>
>> 3) When I saw Don's comments on fixing CTFE, I remembered something D
>> can do _better_  than most others mainstream compiled languages -
>> metaprogramming. So having a StaticRegex compiled at compile-time
>> (inevitable calambour) would give us an  edge over comparative
>> implementations. Also I can imagine quite a lot of applications not
>> requiring *any* regex from non-constant pattern. Major exceptions are
>> editors with find/replace and certain scripting/parsing toolkits.
>> It looks quite natural:
>> ...
>> enum sregNum= StaticRegex!"[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)";
>> ...
>> assert(match("PI: 3.1415926",sregNum).hit == "3.1415926");
>>
>> The best speed gain most likely would be obtained by generating the
>> whole DFA at compile time, I believe such possibility  was previously
>> discussed on the NG.
> This would be a real nice feature.
>
>> This diversity in kinds of regexes may warrant making regex engine an
>> interface ( for homogeneous storage and parameter passing) with
>> concrete implementations being classes.
>>
>> Another thing to sort out is style: are we sticking with ECMA or
>> switching to Perl / GNU? Shall we provide also different versions of
>> syntax?


-- 
Dmitry Olshansky



More information about the Digitalmars-d mailing list