Pegged, From EBNF to PEG

Dmitry Olshansky dmitry.olsh at gmail.com
Sat Mar 17 02:09:02 PDT 2012


On 17.03.2012 10:59, Philippe Sigaud wrote:
> On Wed, Mar 14, 2012 at 10:48, Dmitry Olshansky<dmitry.olsh at gmail.com>  wrote:
>
>>> That's one of the caveats on PEG. That and greedy operators.
>>>
>>> 'a*a' never succeeds because 'a*' consumes all the available a's.
>>>
>>
>> Hey, wait, I thought there has to be backtrack here, i.e. when second 'a'
>> finally fails?
>
> PEG only do local backtracking in 'or' choices, not in sequences. I
> think that's because the original author was dealing with packrat
> parsing and its O(input-size) guarantee.


Ok, let's agree on fact that semantically
a* is :
As <- a As / a

and a*? is this:
As <- a / a As

Now that's local ;)

>
> I remember trying compile-time backtracking in another similar library
> in D 1-2 years ago and getting lots of pb. I might add that in Pegged,
> but I don't know the consequences. How do you do that in std.regex?


There are two fundamental ways to get around the problem, the easiest
to implement is to use a stack to record a position in input + number of 
alternative/production (I call it a PC - program counter) every time 
there is a "fork" like that. Then when the current path ultimately fails 
pop stack, reset input and go over again until you either match or empty 
the stack. That's how to avoid deeeep recursion that happens here.

The problematic thing now is combinatorial explosion of
a* a* a* .... //simplified, but you get the idea

The trick to avoid this sort of crap is to
1) agree that whatever matches first has higher priority (left-most 
match) that's a simple disambiguation rule.
2) record what positions + pc you place on stack e.g. use a set, or in 
fact, a bitmap to push every unique combination  (pc,index) only once.

Voila! Now you have a linear worst-case algorithm with (M*N) complexity 
where M is number of all possible PCs, and N is number of all possible 
indexes in input.

Now if we put aside all crap talk about mathematical purity and monads, 
and you name it, a packrat is essentially this, a dynamic programming 
trick applied to recursive top-down parsing. The trick could be extended 
even more to avoid extra checks on input characters, but the essence is 
this "memoization".

>
> (nice article btw, I learnt some about regexes)

Thanks, I hope it makes them more popular.
Might as well keep me busy fixing bugs :)

-- 
Dmitry Olshansky


More information about the Digitalmars-d-announce mailing list