Going on std.regex & std.uni bug-fixing hunt

Dmitry Olshansky via Digitalmars-d digitalmars-d at puremagic.com
Sun Sep 10 13:48:15 PDT 2017


On Sunday, 10 September 2017 at 18:54:21 UTC, Chad Joan wrote:
> On Sunday, 10 September 2017 at 11:47:02 UTC, Dmitry Olshansky 
> wrote:
>>
>> Yeah, well known problem. Solution is to keep a bit of memory 
>> cached eg  in TLS variable.
>>
>
> Indeed.
>
> Is there another issue I can mark it as a duplicate of?

No it's just somthing that showed up in a number of benchmarks 
people posted, never got around to it. So just file it so I'll 
have an issue to tick once I fix this.

>
>>>
>>> [...]
>>> -- The Captures struct does not specify what value is 
>>> returned for submatches that were in the branch of an 
>>> alternation that wasn't taken or in a repetition that matched 
>>> 0 or more than 1 times.
>>
>> As every engine out there the value is "", empty string.
>>
>
> I usually don't refer to other libraries while using a library.
>  If an API doesn't define something, then it is, by definition, 
> undefined behavior, and thus quite undesirable to rely upon.

In many way we just copy ECMAScript regex semantics, with some 
extensions from Python and Perl.

>
> This one seems pretty easy to fix though.  I will probably make 
> a documentation PR at some point.
>

Please do.
>>>
>>> -- The Captures struct does not seem to have a way to access 
>>> all of the strings matched by a submatch in repetition 
>>> context, not to mention nested repetition contexts.
>>>
>>
>> Just like any other regex library.
>>
>
> Counterexample: 
> https://msdn.microsoft.com/en-us/library/system.text.regularexpressions.group.captures(v=vs.110).aspx#code-snippet-3
>

Horrible, horrible idea. Performance-wise that is.

> I actually have a strong interest in this.  And not because I 
> need to write regular expressions that extract lists of 
> patterns all the time (well, it might've happened).  More 
> importantly: this would make it easier to integrate Phobos' 
> regex engine into a parser generator framework.

No-no-no. Please don't.

> Current plans involve regular expression + parsing expression 
> grammars.  I'm pretty sure it is possible to mechanically 
> convert a subset of PEGs into Regexes and gain some useful 
> optimizations, but this requires granular control over regular 
> expression captures to be able to extract the text matched by 
> the original PEG symbols.

This is heavily misguided, but a common idea. PEGs are actually 
way simpler then regex. PEGs * and + have nothing to do with 
regex * and + qualifiers.

In PEG [ab]*b will never match because [ab]* eats any sequence of 
a-s and b-s and never backtracks.

In regex [ab]* will match b, ab, bb, .... because regex 
'backtracks'.

Thus one can easily see that PEGs constructs are trivial to 
implement, and provide quite a number of optimizations as well.

I'd argue that PEG can be made to run faster in 'parse' scenario 
whereas nothing beats _simple_ regex in 'search' scenario.

>> ThreadCache can go a long way to help that.
>>
>
> Google didn't help me with this one.  Any chance I could get a 
> link?
The manpage for jemalloc mentions it explicitly search for 
tcache. Also look through jemalloc and related papers, I think 
they all call it thread cache.

  https://linux.die.net/man/3/jemalloc

In essene you keep a small cache of free memory in TLS to serve 
next allocations faster. Some suggest a per-cpu cache, which is 
bit trickier but also interestingly avoid contention.

>>> things.  I hope your GC attempt works out!
>>
>> Me too. It's won't be trivial effort though.
>
> Good luck!

Thanks!




More information about the Digitalmars-d mailing list