ctRegex vs. Regex vs. plain string

Dmitry Olshansky dmitry.olsh at gmail.com
Thu Dec 6 07:59:58 PST 2012


12/6/2012 7:21 PM, Chris пишет:
> I have updated my code (finally!) to 2.060.

Congrats!

> As my project deals a lot
> with text processing including loads of special characters (á, ú etc.),
> I make extensive use of the std.regex module (and I really appreciate
> the use of the Thompson NFA). To optimize my program I have experimented
> with ctRegex / StaticRegex and Regex. However, there are still compile
> time problems with Regex and StaticRegex which is why I am using plain
> strings at the moment, which work fine with the same regular
> expressions.

At first I was confused by "make extensive use of the std.regex"  and 
"using plain strings". But then I recalled the problematic "bug" in how 
the compiler treats globals.

So if your code goes like this:

//globals or statics
auto re1 = regex(...);
auto re2 = regex(...);
//...
auto reK = regex(...);

//and e.g. in main:
void main(){
  ... use reX etc. ...
}

Then the long compilations are caused by the compiler doing 
constant-folding on re1-reK variables. This forces it to parse & compile 
these patterns at compile-time.

While it's cute and looks like a minor optimization it can make compile 
times monstrous. Especially as it just produces the same normal pattern 
that R-T regex uses. The way out is to keep compiled patterns on stack 
or initialize them inside of static this.

As for using strings as patterns - it does compile them internally and 
caches the last 8 of them. In other words it should be fine for scripts 
and programs that use a few patterns to go with plain strings. It 
doesn't slow things down considerably even in a tight loop.

But once you are going for about 10+ commonly used patterns then 
precompiling them is a better option.

> Are there any precautions I have to take when using compile
> time regular expressions?

One precaution is to use ctRegex only when things are well tested and 
you are ready to go for that extra speed. It typically takes a lot of 
time and RAM to get it to compile.

Then again testing that results do match is recommended. Simply because 
of the pressure it puts on the compiler ctRegex is not that well tested 
(it goes only through a couple of tests in the Phobos unittests)  unlike 
the regular one.

> Does anyone have any experience as regards
> performance enhancement?
>

You tell me ;) As a matter of fact I collect problematic or frequent 
patterns, guess I need to advertise it somewhere.

Seriously, it depends on patterns and the data. I'd expect about 20-50% 
faster. But there are even cases where it may slow it down (the C-T 
backend is not that sophisticated as primary R-T one... something to 
improve with time).

-- 
Dmitry Olshansky


More information about the Digitalmars-d mailing list