Fast regular expressions

Jascha Wetzel "[firstname]" at mainia.de
Sat Apr 14 17:29:11 PDT 2007


Here is an alpha release of a regular expression engine i'm working on:
http://mainia.de/tdfaregexp-0.1-alpha.zip
(artistic license 2.0)

It implements a slightly modified version of Ville Laurikari's tagged
NFA method that allows for regular expression parsing in linear time as
opposed to the very common backtracking method that requires exponential
time (worst case).

As of now, the engine supports greedy and non-greedy operators,
submatches and character classes. It compiles a regexp to a TDFA from
which it can either be applied directly to an input string or compiled
to D code.
Supported operators: . | ( ) ? * + ?? *? +?

The regexp-to-D compilation cannot be done at compile-time, though,
because the required data structures are too complex to be represented
in tuples or strings.

Compile regexp.d and run it for example like this:
regexp "(a*)([0-9a-fA-Z]*)(g|qwe)" aaaa987IUTqwe aaa 234234
This will output the TNFA, TDFA and D Code and try to match the 3 input
strings.
The file regextest.d is an example of how to use the generated code.

There is still a lot of room for optimization, but i think that
categorically, such a natively compiled, linear time regexp matcher
should be the fastest possible.

Comments and bug reports are encouraged ;)



More information about the Digitalmars-d mailing list