Improving std.regex(p)

Fri Jun 18 08:21:50 PDT 2010

Hi Ellery,

== Quote from Ellery Newcomer (ellery-newcomer at utulsa.edu)'s article
> I've always understood that dfas have an exponential lower bound
> relative to .. something .. but I don't believe I have ever actually
> used a dfa generator (or written one, sadly enough). What has your
> experience been on dfa compile time?
> I suppose RE2's syntax is a good starting point
> http://code.google.com/p/re2/wiki/Syntax
> I wonder about those unicode character classes, though. Looks like an
> imminent head-on collision with dmd's issues regarding memory management
> at compile time.

I also wondered about this and originally just thought, well, you can solve any
performance problems with decent optimal code! :-) So this train of thought was
years ago now and in reality the only way I can really see to bring DFA
compilation to its knees is to use massive repeats.

DFA also implies a more limited syntax (Andrei hinted at this in his original
post) and there are certain things you just don't need to worry about any more
such as 'possessive' quantifiers. Traditionally DFA regex has been used for
lexer generators as they generally don't need some of the more outlandish
'extra' features, so it's possible that a straight up regex engine would have
more demands put on it, but there again, a lexer generator is dealing with a
list of regexes, not just one! The last time I timed it, lexertl could compile
all of the C++ token rules in well under 0.5 seconds.

There are such things as tagged dfas, which I would love to know more about.
The Tango regex library uses them for captures I believe. If anyone can provide
more info on how TDFAs really work that would be fantastic. All the literature
I've read didn't exactly enlighten me (including Ville Laurikari's papers).

The full re2 syntax is way more than I could provide. See http://
www.benhanson.net/lexertl.html for a table of regex syntax I know how to
support.

On the subject of Unicode chars, this is an interesting topic. The Tango regex
library gets around the storage requirements by having a list of ranges. The
approach lexertl takes is to maintain a sorted string of chars. The trick is,
if the string has more than 0.5 of the max possible number of chars in it, you
negate it. This is actually part of the normalisation process and leads to
quicker intersections later on. When it comes to the final machine, lexertl
slices 'wide chars' into bytes for lookup now, keeping the first stage lookup
to 256 entries (I can elaborate if anyone wants me too). Of course if you are
generating code (as seems to be desired) none of this matters very much.

Don't forget that if you intersect all the charsets *once* after building your
syntax tree, then when you do transition intersections, you are only
intersecting sets of indexes to charset equivalence classes which is really
quick. For a quick example:

- Say you have two char classes, one is [a-z] and the other one is [a].
- You intersect these classes once giving [b-z] and [a]
- So if originally [a-z] = 0 and [a] = 1 (as I said, charsets are enumerated as
they are scanned), then now the equivalence index sets are [a-z] = {0, 1} and
[a] = {1} due to [b-z] = 0 and [a] = 1.

When you generate the DFA you are then intersecting {0, 1} and {1} rather than
the actual charsets. The sets are obviously really small! :-)

Hope that was useful, or at least interesting!

Regards,

Ben