[Issue 18600] New: Regex performance enhancement for repeated matchFirst calls

d-bugmail at puremagic.com d-bugmail at puremagic.com
Mon Mar 12 18:20:13 UTC 2018


https://issues.dlang.org/show_bug.cgi?id=18600

          Issue ID: 18600
           Summary: Regex performance enhancement for repeated matchFirst
                    calls
           Product: D
           Version: D2
          Hardware: x86
                OS: Mac OS X
            Status: NEW
          Severity: enhancement
          Priority: P1
         Component: phobos
          Assignee: nobody at puremagic.com
          Reporter: jrdemail2000-dlang at yahoo.com

This is a follow-up to https://issues.dlang.org/show_bug.cgi?id=18114, which
identified a regex performance regression introduced in the 2.078 release. A
number of improvements in 2.079 cut the performance regression meaningfuly
enough to close the issue, but a meaningful delta remained, from 10-25% in my
tests, depending on the regex issued. This indicates headroom for improvement
exists.

Subsequently, Dmitry Olshansky identified another performance opportunity. The
test cases make repeated calls to matchFirst. This is the usage in eBay's TSV
Utilities where this was identified. In this usage, a regex match is made
against every line of input in turn.

Each call to matchFirst instantiates a new version of the regex engine, a
relatively expensive operation. Caching and reusing the engine on subsequent
calls would be faster. This optimization has been available for a while,
however, an increase in instantiation cost may have contributed to the
performance regression identified in 2.078.

In Dmitry's tests, caching the regex engine showed approximately 3x
improvement. This would be an improvement significantly beyond the regression
introduced in 2.078.

My performance tests from issue 18114 (see issue 18114 for test details):

| Regex                 | 2.077.1      | 2.078.0-beta1 | 2.079.0      |
|-----------------------+--------------+---------------+--------------|
| 'abc'                 |  3819.314 ms |  6202.892 ms  |  5077.937 ms |
| '(aa)+(cc)+g'         |  5457.678 ms |  8209.269 ms  |  6672.057 ms |
| '(aa|ax).+[gxb][cyw]' | 10121.181 ms | 12448.443 ms  | 11254.978 ms |

A material improvement over 2.078.0-beta1, but still 10-25% off 2.077.1,
depending on the regex.

Dimitry's tests caching the engine, using the second regex:

# Before
▶ ./find_regex '(aa)+(cc)+g' strings.txt               
106437 matches; 5538.671 milliseconds

# After
▶ ./find_regex '(aa)+(cc)+g' strings.txt
106437 matches; 1880.550 milliseconds

This would be approximately a 3x improvement.

--


More information about the Digitalmars-d-bugs mailing list