[Issue 12690] std.regex BacktrackingMatcher bmatch is faster than ThompsonMatcher but discouraged

via Digitalmars-d-bugs digitalmars-d-bugs at puremagic.com
Sun May 4 03:20:21 PDT 2014


https://issues.dlang.org/show_bug.cgi?id=12690

--- Comment #2 from Diez <yxcvbasdfgqwert02 at gmx.de> ---
Created attachment 1351
  --> https://issues.dlang.org/attachment.cgi?id=1351&action=edit
Source code for more complex test case (UTF-8)

Thank you for your help. I verified the compilation options of my real-world
program and in fact the -inline option was missing (-O was already set).
However, it didn't change anything at the overall performance, it's still 5
seconds (backtracking) versus 47 seconds (Thompson). With this magnitude in
performance difference there's no chance to switch to the Thompson matcher.

The attached source is almost the same as before, but with average expression
and haystack from my real-world program (confidential informations scrambled).
Now it's not a small performance difference, but a factor of more than 5. 

As I understand the meaning of the word "discouraged", the backtracking matcher
is to be removed in some future. Please don't. As long as the Thompson matcher
has this poor performance (for certain use cases), there's need for both
matchers. In fact, using the Thompson matcher is way slower than an equivalent
Java 1.7 program, while using the backtracking matcher is slightly faster. Of
course, optimally there's only one matcher which combines the advantages of all
underlaying matchers.

In my real-world program, there's about 100 times more haystacks than patterns
(940 vs. 112_000). As soon as a pattern matches, the next haystack is examined.
Each haystack needed about 23 tries (avg) until a pattern matched, actually
there were 2_584_925 match calls in total of which 111_675 matched the haystack
(a "few" haystacks didn't match any pattern).

Note that the attached test case source file needs to be saved in UTF-8
encoding, since the regular expression string contains UTF-8 characters
(hopefully, they don't get destroyed during upload into the issue tracking
system). The UTF-8 signature comment line should help Windows editors to
automatically switch to UTF-8 encoding when opening the source file (like an
UTF-8 BOM).

The idea of matching multiple patterns with only one call of the matcher is
very interesting (though I didn't test it yet). Consider a 5 times faster
matcher with 5 matches at once ...

--


More information about the Digitalmars-d-bugs mailing list