Going on std.regex & std.uni bug-fixing hunt
Dmitry Olshansky via Digitalmars-d
digitalmars-d at puremagic.com
Sun Sep 10 13:48:15 PDT 2017
On Sunday, 10 September 2017 at 18:54:21 UTC, Chad Joan wrote:
> On Sunday, 10 September 2017 at 11:47:02 UTC, Dmitry Olshansky
> wrote:
>>
>> Yeah, well known problem. Solution is to keep a bit of memory
>> cached eg in TLS variable.
>>
>
> Indeed.
>
> Is there another issue I can mark it as a duplicate of?
No it's just somthing that showed up in a number of benchmarks
people posted, never got around to it. So just file it so I'll
have an issue to tick once I fix this.
>
>>>
>>> [...]
>>> -- The Captures struct does not specify what value is
>>> returned for submatches that were in the branch of an
>>> alternation that wasn't taken or in a repetition that matched
>>> 0 or more than 1 times.
>>
>> As every engine out there the value is "", empty string.
>>
>
> I usually don't refer to other libraries while using a library.
> If an API doesn't define something, then it is, by definition,
> undefined behavior, and thus quite undesirable to rely upon.
In many way we just copy ECMAScript regex semantics, with some
extensions from Python and Perl.
>
> This one seems pretty easy to fix though. I will probably make
> a documentation PR at some point.
>
Please do.
>>>
>>> -- The Captures struct does not seem to have a way to access
>>> all of the strings matched by a submatch in repetition
>>> context, not to mention nested repetition contexts.
>>>
>>
>> Just like any other regex library.
>>
>
> Counterexample:
> https://msdn.microsoft.com/en-us/library/system.text.regularexpressions.group.captures(v=vs.110).aspx#code-snippet-3
>
Horrible, horrible idea. Performance-wise that is.
> I actually have a strong interest in this. And not because I
> need to write regular expressions that extract lists of
> patterns all the time (well, it might've happened). More
> importantly: this would make it easier to integrate Phobos'
> regex engine into a parser generator framework.
No-no-no. Please don't.
> Current plans involve regular expression + parsing expression
> grammars. I'm pretty sure it is possible to mechanically
> convert a subset of PEGs into Regexes and gain some useful
> optimizations, but this requires granular control over regular
> expression captures to be able to extract the text matched by
> the original PEG symbols.
This is heavily misguided, but a common idea. PEGs are actually
way simpler then regex. PEGs * and + have nothing to do with
regex * and + qualifiers.
In PEG [ab]*b will never match because [ab]* eats any sequence of
a-s and b-s and never backtracks.
In regex [ab]* will match b, ab, bb, .... because regex
'backtracks'.
Thus one can easily see that PEGs constructs are trivial to
implement, and provide quite a number of optimizations as well.
I'd argue that PEG can be made to run faster in 'parse' scenario
whereas nothing beats _simple_ regex in 'search' scenario.
>> ThreadCache can go a long way to help that.
>>
>
> Google didn't help me with this one. Any chance I could get a
> link?
The manpage for jemalloc mentions it explicitly search for
tcache. Also look through jemalloc and related papers, I think
they all call it thread cache.
https://linux.die.net/man/3/jemalloc
In essene you keep a small cache of free memory in TLS to serve
next allocations faster. Some suggest a per-cpu cache, which is
bit trickier but also interestingly avoid contention.
>>> things. I hope your GC attempt works out!
>>
>> Me too. It's won't be trivial effort though.
>
> Good luck!
Thanks!
More information about the Digitalmars-d
mailing list