Tips on making regex more performant?
Dmitry Olshansky
dmitry.olsh at gmail.com
Tue Jun 18 13:54:00 PDT 2013
18-Jun-2013 22:53, Gary Willoughby пишет:
> Below is an example snippet of code to test for performance of regex
> matches. I need to parse a large log and extract data from it and i've
> noticed a huge increase in time of the loop when reading and using regex.
>
> ...
> auto alert = regex(r"^Alert ([0-9]+)");
>
> while ((line = file.readln()) !is null)
> {
> auto m = match(line, alert);
>
> if (m)
> {
> alerts++;
> }
>
> counter++;
> }
> ...
>
Depending on you data size reading more into memory in one go and then
use regex with "g" option on it. The difference is that each time you
call `match` regex engine has to allocate state for the matcher. This
leads to basically doing malloc/free per line. Millions of allocations
per second are surely out of reach.
I've been long wondering what I can do to speed up such one-off matches
as most of tuning had gone into the machine itself, not start/stop
paths. What can be done to fix this is adding TLS cache of memory in
std.regex, so that it reuses the same memory on each call to match.
Partially this was postponed till supposedly soon to come Allocators
design but it never did.
Alternatives.
Simplest alternative would be to use some Appender!char object and fill
it up with say 100 lines.
Second option is read up a big buffer, find a newline from the end of it
and run global match on this chunk. Copy the (hopefully small) tail to
the beginning of buffer and read into the rest.
The example below assumes you can't have line bigger then some buffer size.
(it's a sketch, untested)
alert = regex(..., "gm");
ubyte[] buffer = new ubyte[64*1024];
ubyte[] slice = buffer[];
auto hasData = true;
while(hasData){
size_t got = file.rawRead(slice).length;
// + tail from prev iteration
auto usable = got + buffer.length - slice.length;
if(got < slice.length)
hasData = false;
auto idx = indexOf(retro(buffer[0..got]), '\n');
if(idx < 0){
slice = buffer[];
continue;
}
size_t lastLF = got-idx;
char[] piece = cast(char[])buffer[0..lastLF];
foreach(m; match(piece, alert))
{
//old code here
}
copy(buffer[lastLF..$], buffer[]); //copy leftover to front
slice = buffer[$-lastLF..$]; //read into the rest of buffer
}
This is how I'd go about it ATM if speed is important.
> Using the above example i parse about 700K lines per second (i'm reading
> from an SSD). If i comment out the regex match function, i read at 4.5M
> lines per second. Considering i need to use about 8 regex matches and
> extract data, this figure further drops to about 100K lines per second.
Like test 8 separate patterns? This would multiply aforementioned
malloc/free by 8. One idea is to try putting them all in a single regex
with alternations '|'.
> Is there anything i can do to speed up regex matching in such a
> scenario? Are there any tips you can share to speed things up?
>
There are also compiler flags to fiddle with (dmd):
-O -release -inline -noboundscheck
> Thanks.
--
Dmitry Olshansky
More information about the Digitalmars-d-learn
mailing list