Let's stop parser Hell

Dmitry Olshansky dmitry.olsh at gmail.com
Sat Jul 7 15:05:03 PDT 2012


On 08-Jul-12 01:49, Roman D. Boiko wrote:
> On Saturday, 7 July 2012 at 21:35:52 UTC, Dmitry Olshansky wrote:
>> On 08-Jul-12 01:24, Roman D. Boiko wrote:
>>> But PEG itself doesn't require backtracking parsing, does it?
>>
>> It does. Specifically the algorithm is to try option A, try option B,
>> try option C...
> There is no algorithm, only __grammar__. It specifies that option A has
> higher priority than option B in expression A / B / C. But it doesn't
> forbid, for example, to analyze all in parallel (track state
> transitions) for each character consequently, as a proper DFA
> implementation would do, until some option succeeds and all previous
> fail.

That would not always work, but yes that's what we should do IMHO.
It's not what packrats do however (when I say algorithm I mean 
specifically packrat). And PEGs are commonly associated with packrats.

  A greedy regex would also have to check all successor options (C
> in the exapmle above) to determine the longest one.
>
Yes, yet regex  engine (in simple form) can 'merge' identical parallel 
states*  easily it's not so simple with general CFGs.

*As in NFA states, I like to think in multiple NFA states vs single DFA 
state.

>> it's obvious how backtracking comes in ... whan say A fails somewhere
>> in the middle of it.
> Simply stop tracking respective state. Process others in parallel as I
> described above.
>
Yeah it could be done, but it doesn't buy you a thing in a lot of cases 
unlike in regular grammar. The problem is that state is not as simple as 
in regex for instance (even in regex that must capture  some submatches 
DFA won't do BTW). Thus you can't merge seemingly identical states 
because they depend on (different) interpretation of previous part of 
input.

That's why DFA in ANTLR is only used to do lookahead or lexing, it 
doesn't maintain path through states during parsing.

>> Of course there is a fair amount of bookkeeping that reduces doing
>> work all over again but it does ... allocate.
>
> The amount of used memory would be proportional to the length of input
> after the branching point times number of rules (actually, of states in
> the DFA, which is likely larger but still finite for a particular grammar).
>
Finite and proportional to input size? Another way to say why an O(n) 
memory requirement for packrats?

>>>> Tokens.. there is no such term in use if we talk about 'pure' PEG.
>>>
>>> Terminal symbols.
>>>
>> Characters.
> Nope. According to the definition, PEG is a set of non-terminal symbols,
> terminal symbols, production rules, and a starting non-terminal.
> Terminals are not necessarily characters.

Yup. But I'm not talking definitions here I'm talking the notion of 
"integrated lexing" and lexing is certainly about characters or was it?


-- 
Dmitry Olshansky




More information about the Digitalmars-d mailing list