Tips on making regex more performant?

KJS kirkjobsluder at fastmail.fm
Tue Jun 18 16:26:09 PDT 2013


On Tuesday, 18 June 2013 at 18:53:34 UTC, Gary Willoughby wrote:
> Below is an example snippet of code to test for performance of 
> regex matches. I need to parse a large log and extract data 
> from it and i've noticed a huge increase in time of the loop 
> when reading and using regex.
>
> 	...
> 	auto alert = regex(r"^Alert ([0-9]+)");
>
> 	while ((line = file.readln()) !is null)
> 	{
> 		auto m = match(line, alert);
>
> 		if (m)
> 		{
> 			alerts++;
> 		}
>
> 		counter++;
> 	}
> 	...
>
> Using the above example i parse about 700K lines per second 
> (i'm reading from an SSD). If i comment out the regex match 
> function, i read at 4.5M lines per second. Considering i need 
> to use about 8 regex matches and extract data, this figure 
> further drops to about 100K lines per second.
>
> Is there anything i can do to speed up regex matching in such a 
> scenario? Are there any tips you can share to speed things up?
>
> Thanks.

I'm working with some string-heavy applications so I was curious 
about this myself. I'm new to D, but I did some heavy data 
analysis on chat files a while back.

Not knowing anything about your data or what other queries you 
might want to do on it, matching the first part of the string 
with std.algorithm.startsWith() and splitting the line on a 
delimiter outperforms regex matching on my admittedly arbitrary 
test code. I tested two extremes at 10,000,000 rounds:

Match everything: ~39 seconds for match, ~8 seconds for 
startsWith/split.
Match fails at start of string: ~10 seconds for match, ~1 second 
for startsWith/split.
Match fails at end of string: ~15 seconds for match, ~1 second 
for startsWith/split.

Even if you need a regex to match the middle, it might be 
worthwhile to filter the list with startsWith if you're matching 
a fixed string at the start of the line. Again, it depends on the 
frequency of hits and how the data is structured.

Split is probably not the best way to slice the match, but I 
don't have time tonight to try other slicing methods.




More information about the Digitalmars-d-learn mailing list