Let's stop parser Hell
Dmitry Olshansky
dmitry.olsh at gmail.com
Mon Aug 6 12:31:12 PDT 2012
On 06-Aug-12 22:52, Philippe Sigaud wrote:
>> I cut my teeth on perl, so everything gets compared to perl in my mind.
>>
>> In perl, I can 'precompile' a regular expression, so I can avoid some
>> overhead. So something like this:
>>
>> while(<STDIN>){
>> if($_ =~ m/<someregex>/){
>> some work;
>> }
>> }
>>
>> Would end up re-compiling the regex on each line from STDIN. Perl uses the
>> term "compiling" the regular expression, which may be different than what
>> you call "setup regex engine".
Of course regex is first compiled to bytecode (same thing as "compile"
in perl). Moreover if you use regex pattern directly it is compiled on
first use and put into TLS cache of compiled patterns. From now on it's
used in compiled form. (there about 8 entries in cache, don't relay on
it too much).
What I mean by "setup engine" is extra cost spent on _starting_ matching
with compiled pattern. For one thing it adds 1 call to malloc/free.
Again I think in Perl the same thing applies. In other words doing
continuous search (via foreach(x; match(..., "g" )) ) is much faster
then calling match on individual pieces over and over again in cycle.
>>
>> Does/Can D's std.regex offer something like this? If not, I would be
>> interested in why not.
>
Of course it does, in fact you can't match without compiling pattern
first (it just provides a shortcut that does it for you behind the scenes).
> D does have compiled regex, it's called ctRegex (meaning compile-time regex):
>
> http://dlang.org/phobos/std_regex.html#ctRegex
>
And there is a second version - compiled native code. Unlike perl it's
not bytecode and thus usually much faster.
Frankly the most slow regex I've seen is in Python, the second most
sucky one is PCRE (but is incredibly popular somehow). Perl is not bad
but usually slower then top dogs from C++ & std.regex.
> The tests done recently put it among the fastest regex engine known.
Yeah, on top of said chart. Needless to say the test favored my
implementation, ctRegex is not the fastest regex engine in general (yet).
> Caution: not every regex extension known to mankind is implemented
> here!
Sure as hell. In fact, the most problematic thing is that parser often
fails during CTFE.
Also I have a solid plan on enhancing a bunch of things effectively
making std.regex v2 but no sooner then October-November.
--
Dmitry Olshansky
More information about the Digitalmars-d
mailing list