Tips on making regex more performant?

Tue Jun 18 13:54:00 PDT 2013

18-Jun-2013 22:53, Gary Willoughby пишет:
> Below is an example snippet of code to test for performance of regex
> matches. I need to parse a large log and extract data from it and i've
> noticed a huge increase in time of the loop when reading and using regex.
>
>      ...
>      auto alert = regex(r"^Alert ([0-9]+)");
>
>      while ((line = file.readln()) !is null)
>      {
>          auto m = match(line, alert);
>
>          if (m)
>          {
>              alerts++;
>          }
>
>          counter++;
>      }
>      ...
>

Depending on you data size reading more into memory in one go and then 
use regex  with "g" option on it. The difference is that each time you 
call `match` regex engine has to allocate state for the matcher. This 
leads to basically doing malloc/free per line. Millions of allocations 
per second are surely out of reach.

I've been long wondering what I can do to speed up such one-off matches 
as most of tuning had gone into the machine itself, not start/stop 
paths. What can be done to fix this is adding TLS cache of memory in 
std.regex, so that it reuses the same memory on each call to match.
Partially this was postponed till supposedly soon to come Allocators 
design but it never did.

Alternatives.

Simplest alternative would be to use some Appender!char object and fill 
it up with say 100 lines.

Second option is read up a big buffer, find a newline from the end of it 
and run global match on this chunk. Copy the (hopefully small) tail to 
the beginning of buffer and read into the rest.

The example below assumes you can't have line bigger then some buffer size.

(it's a sketch, untested)

alert = regex(..., "gm");
ubyte[] buffer = new ubyte[64*1024];
ubyte[] slice = buffer[];
auto hasData = true;
while(hasData){
	size_t got = file.rawRead(slice).length;
	// + tail from prev iteration
	auto usable = got + buffer.length - slice.length;
	if(got < slice.length)
		hasData = false;
	auto idx = indexOf(retro(buffer[0..got]), '\n');
	if(idx < 0){
		slice = buffer[];
		continue;
	}
	size_t lastLF = got-idx;
	char[] piece = cast(char[])buffer[0..lastLF];
	foreach(m; match(piece, alert))
	{
		//old code here
	}
	copy(buffer[lastLF..$], buffer[]); //copy leftover to front
	slice = buffer[$-lastLF..$]; //read into the rest of buffer
}

This is how I'd go about it ATM if speed is important.

> Using the above example i parse about 700K lines per second (i'm reading
> from an SSD). If i comment out the regex match function, i read at 4.5M
> lines per second. Considering i need to use about 8 regex matches and
> extract data, this figure further drops to about 100K lines per second.

Like test 8 separate patterns? This would multiply aforementioned 
malloc/free by 8. One idea is to try putting them all in a single regex 
with alternations '|'.

> Is there anything i can do to speed up regex matching in such a
> scenario? Are there any tips you can share to speed things up?
>

There are also compiler flags to fiddle with (dmd):
  -O -release -inline -noboundscheck

> Thanks.

-- 
Dmitry Olshansky