Improving std.regex(p)

Sat Jun 19 13:33:05 PDT 2010

== Quote from Ellery Newcomer (ellery-newcomer at utulsa.edu)'s article
> On 06/19/2010 03:35 AM, Ben Hanson wrote:
> > I can't prove what the comlexity is for DFA compilation, but instead, I
> > challenge anyone to show slow compilation times with any DFA compatible
regex.
> > As I don't have some D code yet, you could try the current lexertl code
(just
> > add a single rule). Compile as Release if using Visual C++, as the iterator
> > debugging kills performance in VC++ 8 or above (one problem I guess wouldn't
> > exist with a D codebase).
> >
> > As I say, the only obvious area for me that hammers compilation times is
huge
> > repeats. If you repeat an expression 50,0000 times or something, then yes,
you
> > will get a huge machine and it will take time to construct! :-)
> et unicode?
The latest lexertl takes the following strategy:

- If you're building a lexer using 8 bit chars, then no problem. Two phase
lookup, first lookup map all 256 chars to equivalence class index, then that
index is used to locate the correct column in the DFA.
- If you're using 'wide characters', then still have a 256 entry lookup table
and slice each 'wide character' into 8 bit chunks and use those chunks as
charsets.
- If you want to generate code, you can generate tables using a first phase
lookup of MAX_CHARS of the 'wide character'. You can then convert this back to
a character based state machine and then you can just deal with characters
again instead of equivalence classes.

One of the enhancements I want to make, is the ability to go straight to a char
based state machine to avoid the huge first phase lookup.

Now, as far a huge Unicode ranges goes, this shouldn't be a problem when you're
generating code. You can use an if statement with an upper and lower bound in
those cases and only use a switch statement when the range is disparate etc.

If the technique of storing all those chars in a string ends up being too
expensive on RAM (even with the auto negation trick), then there's always the
method used by Tango Regex, which is a list of ranges (this probably works
quite well with garbage collection - remember I designed for C++).

Let me know if any of that isn't clear. Basically it's not a huge problem, you
can use various techniques to compress the data sensibly and efficiently.
> >
> > There are papers on tackling this with counting repeats, even for DFAs. If
this
> > area interests anyone, please let me know.
> it does
I'll have to dig out any references I have! I don't have them to hand right now.
> >
> > DFA compilation is particularly good for simplifying verbose regexes, even
> > wihout DFA minimisation. Hybrid DFA/NFA engines have been written in the
past
> > and as far as feaures go, this is probably the ideal way to go. Russ Cox has
> > some clever techniques too.
> interesting
Russ (from memory) uses competing 'threads' in his matcher (http://swtch.com/
~rsc/regexp/regexp2.html) http://swtch.com/~rsc/regexp/regexp3.html discusses
RE2.
> >
> > Don't forget that grep is (traditionally) implemented with a DFA engine.
> I didn't know that
A lot of automata theory appears to have been forgotten, as noted by Russ Cox
(amongst others). Lua has an interesting pattern matcher, as it goes more in
the formal grammar direction to (for example) match balanced parenthesis etc.
Another thing I'd like to do (no doubt time will not allow!) is to extend
Notepad RE (http://www.codeproject.com/KB/recipes/notepadre.aspx) to have the
LUA pattern matcher as an option.
> > Of course, the real irony is that Henry Spencer wrote a much better regex
> > engine for TCL (a DFA/NFA hybrid as I understand it) which is much better
than
> > his original NFA backtracker and yet everyone holds up the backtracking
> > algorithm as the way to do things...
> >
> > That's inertia for you.
> >
> > Regards,
> >
> > Ben
Regards,

Ben