Bad performance of simple regular expression - why??
Benji Smith
dlanguage at benjismith.net
Wed Feb 7 18:28:16 PST 2007
Walter Bright wrote:
> Tomas Lindquist Olsen wrote:
>> http://swtch.com/~rsc/regexp/regexp1.html
>
> That looks pretty cool. Anyone care to take a stab at implementing this
> in std.regexp? It could be done by examining the generated bytecode and
> attempting to convert it to an NFA. If that succeeds, then great, do the
> NFA. If it fails, then fallback to the original method.
A little while back, at my job, I implemented a regex parser (in Java)
that builds an NFA and then transforms it into a DFA, re-serializing it
into a regex for consumption by any old regex engine.
For example, this regex assumes NFA performance characteristics when
compiled by the (java, perl, python, etc) regex engine:
(?:cat|car|cut|cute|cumbersome)
This regex is semantically identical (meaning that it will produce
exactly the same set of matches as the simpler expression), but it
assumes DFA performance characteristics in the regex engine:
c(?:a[tr]|u(?:te?|umbersome))
Since I needed my regular expressions to execute in an ordinary regex
engine, I couldn't tweak the engine internals, but I could trick the
regex engine into being a DFA.
The major caveat to this approach is that parenthetical groups get
jumbled. If you use parentheses to capture text (rather than just to
match it), the numbering of your backreferences is going to get all
screwed up. When I use this approach, I almost always use noncapturing
parentheses (?: ... ) rather than ordinary parens.
However, if you had control of the regex engine itself (rather than just
feeding tweaked expressions into an ordinary engine), you could easily
implement DFA behavior characteristics without changing the capturing
semantics.
I'd implement this myself for D, but I'm afraid my NDA with my employer
would probably prevent me from doing so. The theory is straightforward,
though, so it should be do-able by anyone with the appropriate RE
experience.
--Benji Smith
More information about the Digitalmars-d
mailing list