Let's stop parser Hell

Mon Aug 6 12:59:24 PDT 2012

On 06-Aug-12 23:43, Philippe Sigaud wrote:
> On Mon, Aug 6, 2012 at 9:31 PM, Dmitry Olshansky <dmitry.olsh at gmail.com> wrote:
>
>> Of course regex is first compiled to bytecode (same thing as "compile" in
>> perl). Moreover if you use regex pattern directly it is compiled on first
>> use and put into TLS cache of compiled patterns. From now on it's used in
>> compiled form. (there about 8 entries in cache, don't relay on it too much).
>
> Btw, I wanted to ask you that for a long time: what do you mean by
> 'compiled to bytecode', for D source?
>

by this:
auto r = regex(some_string); // where some_string can come from user input.
r - contains bytecode that matches pattern.

Unlike ctRegex which does output D source code and compiles it to native 
code.
>> And there is a second version - compiled native code. Unlike perl it's not
>> bytecode and thus usually much faster.
>
> Which?

Compiled native code is faster. Or what you
>
>
>> Frankly the most slow regex I've seen is in Python, the second most sucky
>> one is PCRE (but is incredibly popular somehow). Perl is not bad but usually
>> slower then top dogs from C++ & std.regex.
>
> Do people *really* need speed, or a great number of extensions?
>

They want both. In my experience you can't satisfy everybody.
I think features are not what people look for in regex, even basic 
ECMA-262 stuff is way above what most programmers need to do. Unlike 
extra speed which even newbies could use :)
And while I'm at it - we already have a damn good collection of 
extensions, way too much I'd say.

For example, I've yet to see one regex engine that support Unicode to 
the same level as std.regex, namely I haven't seen a single one with 
full set operations (not only union but subtraction, intersection etc.) 
inside of char class [...]. Funny even ICU one didn't support them a 
year ago, dunno about now.

Also some extensions come from implementations inherent inefficiency 
e.g. (as in Perl) possessive quantifiers, atomic groups. No wonder it's 
so hard to make look-behind unrestricted in Perl, the whole thing is a mess.

>> Sure as hell. In fact, the most problematic thing is that parser often fails
>> during CTFE.
>
> For example?

Ehmn. See bugzilla, search ctRegex.
But not yet filed are: said non-union set operations usually fail + the 
fact that Unicode properties are not readable at CT (thus \p{xxx} fails).

>
>> Also I have a solid plan on enhancing a bunch of things effectively making
>> std.regex v2 but no sooner then October-November.
>
> That's good to know.
>

-- 
Dmitry Olshansky