Let's stop parser Hell
Dmitry Olshansky
dmitry.olsh at gmail.com
Mon Aug 6 12:59:24 PDT 2012
On 06-Aug-12 23:43, Philippe Sigaud wrote:
> On Mon, Aug 6, 2012 at 9:31 PM, Dmitry Olshansky <dmitry.olsh at gmail.com> wrote:
>
>> Of course regex is first compiled to bytecode (same thing as "compile" in
>> perl). Moreover if you use regex pattern directly it is compiled on first
>> use and put into TLS cache of compiled patterns. From now on it's used in
>> compiled form. (there about 8 entries in cache, don't relay on it too much).
>
> Btw, I wanted to ask you that for a long time: what do you mean by
> 'compiled to bytecode', for D source?
>
by this:
auto r = regex(some_string); // where some_string can come from user input.
r - contains bytecode that matches pattern.
Unlike ctRegex which does output D source code and compiles it to native
code.
>> And there is a second version - compiled native code. Unlike perl it's not
>> bytecode and thus usually much faster.
>
> Which?
Compiled native code is faster. Or what you
>
>
>> Frankly the most slow regex I've seen is in Python, the second most sucky
>> one is PCRE (but is incredibly popular somehow). Perl is not bad but usually
>> slower then top dogs from C++ & std.regex.
>
> Do people *really* need speed, or a great number of extensions?
>
They want both. In my experience you can't satisfy everybody.
I think features are not what people look for in regex, even basic
ECMA-262 stuff is way above what most programmers need to do. Unlike
extra speed which even newbies could use :)
And while I'm at it - we already have a damn good collection of
extensions, way too much I'd say.
For example, I've yet to see one regex engine that support Unicode to
the same level as std.regex, namely I haven't seen a single one with
full set operations (not only union but subtraction, intersection etc.)
inside of char class [...]. Funny even ICU one didn't support them a
year ago, dunno about now.
Also some extensions come from implementations inherent inefficiency
e.g. (as in Perl) possessive quantifiers, atomic groups. No wonder it's
so hard to make look-behind unrestricted in Perl, the whole thing is a mess.
>> Sure as hell. In fact, the most problematic thing is that parser often fails
>> during CTFE.
>
> For example?
Ehmn. See bugzilla, search ctRegex.
But not yet filed are: said non-union set operations usually fail + the
fact that Unicode properties are not readable at CT (thus \p{xxx} fails).
>
>> Also I have a solid plan on enhancing a bunch of things effectively making
>> std.regex v2 but no sooner then October-November.
>
> That's good to know.
>
--
Dmitry Olshansky
More information about the Digitalmars-d
mailing list