[Issue 12690] std.regex BacktrackingMatcher bmatch is faster than ThompsonMatcher but discouraged

Sun May 4 04:47:08 PDT 2014

https://issues.dlang.org/show_bug.cgi?id=12690

--- Comment #3 from Dmitry Olshansky <dmitry.olsh at gmail.com> ---
(In reply to Diez from comment #2)
> Created attachment 1351 [details]
> Source code for more complex test case (UTF-8)
> 
> Thank you for your help. I verified the compilation options of my real-world
> program and in fact the -inline option was missing (-O was already set).
> However, it didn't change anything at the overall performance, it's still 5
> seconds (backtracking) versus 47 seconds (Thompson). With this magnitude in
> performance difference there's no chance to switch to the Thompson matcher.
> 
> The attached source is almost the same as before, but with average
> expression and haystack from my real-world program (confidential
> informations scrambled). Now it's not a small performance difference, but a
> factor of more than 5. 
> 

I see. The fact that -inline doesn't change a thing is troublesome but lets
leave that to compiler guys.

The new pattern makes things much clearer - as I suspected the preparation
stage of the Thompson engine is the one to blame. Preparing an internal
freelist alone takes ~1/3 of total run-time in this case. The figures are much
higher this time about x3 difference.

I have a hunch that it allocates much more then needed in this case, might be a
bug in counting up repetitions. Need to investigate further.

> As I understand the meaning of the word "discouraged", the backtracking
> matcher is to be removed in some future. Please don't. As long as the
> Thompson matcher has this poor performance (for certain use cases), there's
> need for both matchers. In fact, using the Thompson matcher is way slower
> than an equivalent Java 1.7 program, while using the backtracking matcher is
> slightly faster. Of course, optimally there's only one matcher which
> combines the advantages of all underlaying matchers.

Discouraged means exactly what you say - picking the type of underlying matcher
should be done internally based on pattern properties (or read that as one
meta-matcher). It occurred to me that 
- There are more types of matchers, some are very limited in what they can.
Thus should't push it on user to understand when he/she can use them.
- More matchers - more functions doesn't scale.
- Also replace functions have the same NxM combinatorics (n methods, m
functions), there is also splitter.

I envision that when std.regex turns package, it will gradually expose
internals for those who wants to make customize/their own matcher.

> In my real-world program, there's about 100 times more haystacks than
> patterns (940 vs. 112_000). As soon as a pattern matches, the next haystack
> is examined. Each haystack needed about 23 tries (avg) until a pattern
> matched, actually there were 2_584_925 match calls in total of which 111_675
> matched the haystack (a "few" haystacks didn't match any pattern).
> 
> Note that the attached test case source file needs to be saved in UTF-8
> encoding, since the regular expression string contains UTF-8 characters
> (hopefully, they don't get destroyed during upload into the issue tracking
> system). The UTF-8 signature comment line should help Windows editors to
> automatically switch to UTF-8 encoding when opening the source file (like an
> UTF-8 BOM).
> 
> The idea of matching multiple patterns with only one call of the matcher is
> very interesting (though I didn't test it yet). 

Well, it's not yet implemented ;) But I expect to get to it done somewhere in
May.

> Consider a 5 times faster
> matcher with 5 matches at once ...

--