Improving std.regex(p)

Ben Hanson Ben.Hanson at tfbplc.co.uk
Sat Jun 19 01:35:53 PDT 2010


== Quote from Ellery Newcomer (ellery-newcomer at utulsa.edu)'s article
> The NFA->DFA algorithms are what I was referring to. My copy of the
> dragon book suggests that DFAs usually have O(n^3) initialization time
> but O(n^2 * 2^n) worst case relative to the number of states. I'd love
> to be shown wrong, though.
I can't prove what the comlexity is for DFA compilation, but instead, I
challenge anyone to show slow compilation times with any DFA compatible regex.
As I don't have some D code yet, you could try the current lexertl code (just
add a single rule). Compile as Release if using Visual C++, as the iterator
debugging kills performance in VC++ 8 or above (one problem I guess wouldn't
exist with a D codebase).

As I say, the only obvious area for me that hammers compilation times is huge
repeats. If you repeat an expression 50,0000 times or something, then yes, you
will get a huge machine and it will take time to construct! :-)

There are papers on tackling this with counting repeats, even for DFAs. If this
area interests anyone, please let me know.

> You don't get a performance hit with DFAs on knarly regexen (actually,
> they would be faster than NFAs), just regexen that take forever to
> compile. I wonder if it would be possible to mechanically detect problem
> cases for DFA compilation and on detection switch over to an NFA engine.

DFA compilation is particularly good for simplifying verbose regexes, even
wihout DFA minimisation. Hybrid DFA/NFA engines have been written in the past
and as far as feaures go, this is probably the ideal way to go. Russ Cox has
some clever techniques too.

Don't forget that grep is (traditionally) implemented with a DFA engine.

Of course, the real irony is that Henry Spencer wrote a much better regex
engine for TCL (a DFA/NFA hybrid as I understand it) which is much better than
his original NFA backtracker and yet everyone holds up the backtracking
algorithm as the way to do things...

That's inertia for you.

Regards,

Ben


More information about the Digitalmars-d mailing list