Tips on making regex more performant?
KJS
kirkjobsluder at fastmail.fm
Tue Jun 18 16:26:09 PDT 2013
On Tuesday, 18 June 2013 at 18:53:34 UTC, Gary Willoughby wrote:
> Below is an example snippet of code to test for performance of
> regex matches. I need to parse a large log and extract data
> from it and i've noticed a huge increase in time of the loop
> when reading and using regex.
>
> ...
> auto alert = regex(r"^Alert ([0-9]+)");
>
> while ((line = file.readln()) !is null)
> {
> auto m = match(line, alert);
>
> if (m)
> {
> alerts++;
> }
>
> counter++;
> }
> ...
>
> Using the above example i parse about 700K lines per second
> (i'm reading from an SSD). If i comment out the regex match
> function, i read at 4.5M lines per second. Considering i need
> to use about 8 regex matches and extract data, this figure
> further drops to about 100K lines per second.
>
> Is there anything i can do to speed up regex matching in such a
> scenario? Are there any tips you can share to speed things up?
>
> Thanks.
I'm working with some string-heavy applications so I was curious
about this myself. I'm new to D, but I did some heavy data
analysis on chat files a while back.
Not knowing anything about your data or what other queries you
might want to do on it, matching the first part of the string
with std.algorithm.startsWith() and splitting the line on a
delimiter outperforms regex matching on my admittedly arbitrary
test code. I tested two extremes at 10,000,000 rounds:
Match everything: ~39 seconds for match, ~8 seconds for
startsWith/split.
Match fails at start of string: ~10 seconds for match, ~1 second
for startsWith/split.
Match fails at end of string: ~15 seconds for match, ~1 second
for startsWith/split.
Even if you need a regex to match the middle, it might be
worthwhile to filter the list with startsWith if you're matching
a fixed string at the start of the line. Again, it depends on the
frequency of hits and how the data is structured.
Split is probably not the best way to slice the match, but I
don't have time tonight to try other slicing methods.
More information about the Digitalmars-d-learn
mailing list