"Regular Expression Matching Can Be Simple And Fast (but...) "

PeteC petevik38 at yahoo.com.au
Sun Feb 27 04:40:38 PST 2011


A few months ago I was using D on a project that involved filtering and
tranforming file paths. Using D helped a lot, but I was a bit frustrated
to discover there were a number of bugs in the standard regex library. I
was able to fix a couple, which I reported in bugzilla, but I couldn't
understand the source code well enough to deal with all of them.

After reading Russ Cox's articles, I implemented an experimental regular
expression engine based on the operations and execution methods
mentioned in the http://swtch.com/~rsc/regexp/regexp2.html article.

It's not yet completed, but I've put it up on github if anyone is
interested in looking at it:
https://github.com/PeteChadwick/pcregexeng

Bear in mind that at this stage it's just an experiment and isn't
intented to be used as a regex library. Comments, suggestions and
criticisms are welcome.

The library includes a Thompson style machine, and a very basic
recursive backtracking machine. I tried a few simple benchmarks of these
machines, alongside std.regex.match. The results are below:

Regex test

Regex: '([a-zA-Z0-9._%+-]+)@([a-zA-Z0-9.-]+\.[a-zA-Z]{2,4})'
Iterations: 100000
Text length: 20
Text: User at domain.name.com
lockstep  (Email): 2699 ticks  1415.05 ticks/MB (User at domain.name.com)
backtrack (Email): 109 ticks 57.1474 ticks/MB (User at domain.name.com)
std.regex (Email): 250 ticks 131.072 ticks/MB

Regex: '([a-zA-Z0-9._%+-]+)@([a-zA-Z0-9.-]+\.[a-zA-Z]{2,4})'
Iterations: 10
Text length: 210020
Text: not-an-email-address not-an-email-address not-an-e...
lockstep  (Email prefix 1): 2074 ticks  1035.5 ticks/MB (User at domain.name.com)
backtrack (Email prefix 1): 468 ticks 233.66 ticks/MB (User at domain.name.com)
std.regex (Email prefix 1): 1342 ticks 670.026 ticks/MB

Regex: '([a-zA-Z0-9._%+-]+)@([a-zA-Z0-9.-]+\.[a-zA-Z]{2,4})'
Iterations: 10
Text length: 1048596
Text: ||||||||||||||||||||||||||||||||||||||||||||||||||...
lockstep  (Email prefix 2): 5944 ticks  594.389 ticks/MB (User at domain.name.com)
backtrack (Email prefix 2): 717 ticks 71.6986 ticks/MB (User at domain.name.com)
std.regex (Email prefix 2): 468 ticks 46.7991 ticks/MB

Regex: 'a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaa'
Iterations: 1
Text length: 18
Text: aaaaaaaaaaaaaaaaaa
lockstep  (Pathalogical): 0 ticks  0 ticks/MB (aaaaaaaaaaaaaaaaaa)
backtrack (Pathalogical): 47 ticks 2.73795e+06 ticks/MB (aaaaaaaaaaaaaaaaaa)
std.regex (Pathalogical): 14586 ticks 8.49696e+08 ticks/MB

It terms of ticks/MB the Thompson machine implementation is quite a lot
slower in these tests, but shows more consistency, and is able to deal
with the pathalogical case.


More information about the Digitalmars-d mailing list