Bad performance of simple regular expression - why??

Benji Smith dlanguage at benjismith.net
Wed Feb 7 18:28:16 PST 2007


Walter Bright wrote:
> Tomas Lindquist Olsen wrote:
>> http://swtch.com/~rsc/regexp/regexp1.html
> 
> That looks pretty cool. Anyone care to take a stab at implementing this 
> in std.regexp? It could be done by examining the generated bytecode and 
> attempting to convert it to an NFA. If that succeeds, then great, do the 
> NFA. If it fails, then fallback to the original method.

A little while back, at my job, I implemented a regex parser (in Java) 
that builds an NFA and then transforms it into a DFA, re-serializing it 
into a regex for consumption by any old regex engine.

For example, this regex assumes NFA performance characteristics when 
compiled by the (java, perl, python, etc) regex engine:

   (?:cat|car|cut|cute|cumbersome)

This regex is semantically identical (meaning that it will produce 
exactly the same set of matches as the simpler expression), but it 
assumes DFA performance characteristics in the regex engine:

   c(?:a[tr]|u(?:te?|umbersome))

Since I needed my regular expressions to execute in an ordinary regex 
engine, I couldn't tweak the engine internals, but I could trick the 
regex engine into being a DFA.

The major caveat to this approach is that parenthetical groups get 
jumbled. If you use parentheses to capture text (rather than just to 
match it), the numbering of your backreferences is going to get all 
screwed up. When I use this approach, I almost always use noncapturing 
parentheses (?: ... ) rather than ordinary parens.

However, if you had control of the regex engine itself (rather than just 
feeding tweaked expressions into an ordinary engine), you could easily 
implement DFA behavior characteristics without changing the capturing 
semantics.

I'd implement this myself for D, but I'm afraid my NDA with my employer 
would probably prevent me from doing so. The theory is straightforward, 
though, so it should be do-able by anyone with the appropriate RE 
experience.

--Benji Smith



More information about the Digitalmars-d mailing list