[Issue 18600] New: Regex performance enhancement for repeated matchFirst calls
d-bugmail at puremagic.com
d-bugmail at puremagic.com
Mon Mar 12 18:20:13 UTC 2018
https://issues.dlang.org/show_bug.cgi?id=18600
Issue ID: 18600
Summary: Regex performance enhancement for repeated matchFirst
calls
Product: D
Version: D2
Hardware: x86
OS: Mac OS X
Status: NEW
Severity: enhancement
Priority: P1
Component: phobos
Assignee: nobody at puremagic.com
Reporter: jrdemail2000-dlang at yahoo.com
This is a follow-up to https://issues.dlang.org/show_bug.cgi?id=18114, which
identified a regex performance regression introduced in the 2.078 release. A
number of improvements in 2.079 cut the performance regression meaningfuly
enough to close the issue, but a meaningful delta remained, from 10-25% in my
tests, depending on the regex issued. This indicates headroom for improvement
exists.
Subsequently, Dmitry Olshansky identified another performance opportunity. The
test cases make repeated calls to matchFirst. This is the usage in eBay's TSV
Utilities where this was identified. In this usage, a regex match is made
against every line of input in turn.
Each call to matchFirst instantiates a new version of the regex engine, a
relatively expensive operation. Caching and reusing the engine on subsequent
calls would be faster. This optimization has been available for a while,
however, an increase in instantiation cost may have contributed to the
performance regression identified in 2.078.
In Dmitry's tests, caching the regex engine showed approximately 3x
improvement. This would be an improvement significantly beyond the regression
introduced in 2.078.
My performance tests from issue 18114 (see issue 18114 for test details):
| Regex | 2.077.1 | 2.078.0-beta1 | 2.079.0 |
|-----------------------+--------------+---------------+--------------|
| 'abc' | 3819.314 ms | 6202.892 ms | 5077.937 ms |
| '(aa)+(cc)+g' | 5457.678 ms | 8209.269 ms | 6672.057 ms |
| '(aa|ax).+[gxb][cyw]' | 10121.181 ms | 12448.443 ms | 11254.978 ms |
A material improvement over 2.078.0-beta1, but still 10-25% off 2.077.1,
depending on the regex.
Dimitry's tests caching the engine, using the second regex:
# Before
▶ ./find_regex '(aa)+(cc)+g' strings.txt
106437 matches; 5538.671 milliseconds
# After
▶ ./find_regex '(aa)+(cc)+g' strings.txt
106437 matches; 1880.550 milliseconds
This would be approximately a 3x improvement.
--
More information about the Digitalmars-d-bugs
mailing list