Let's stop parser Hell

Sun Jul 8 16:37:19 PDT 2012

On Sunday, 8 July 2012 at 21:22:39 UTC, Roman D. Boiko wrote:
> On Sunday, 8 July 2012 at 21:03:41 UTC, Jonathan M Davis wrote:
>> It's been too long since I was actively working on parsers to 
>> give any details, but it is my understanding that because a 
>> hand-written parser is optimized for a specific grammar, it's 
>> going to be faster.
>
> My aim is to find out any potential bottlenecks and ensure that 
> those are possible to get rid of. So, let's try.
>
> I believe it would not hurt generality or quality of a parser 
> generator if it contained sews for inserting custom (optimized) 
> code where necessary, including those needed to take advantage 
> of some particular aspects of D grammar. Thus I claim that 
> optimization for D grammar is possible.

I'm convinced that the output of a parser generator (PG) can be 
very nearly as fast as hand-written code. ANTLR's output (last I 
checked) was not ideal, but the one I planned to make (a few 
years ago) would have produced faster code.

By default the PG's output will not be the speed of hand-written 
code, but the user can optimize it. Assuming an ANTLR-like PG, 
the user can inspect the original output looking for inefficient 
lookahead, or cases where the parser looks for rare cases before 
common cases, and then improve the grammar and insert ... I 
forget all the ANTLR terminology ... syntactic predicates or 
whatever, to optimize the parser.

> So far discussion goes in favor of LL(*) parser like ANTLR, 
> which is top-down recursive-descent. Either Pegged will be 
> optimized with LL(*) algorithms, or a new parser generator 
> created.

Right, for instance I am interested in writing a top-down PG 
because I understand them better and prefer the top-down approach 
due to its flexibility (semantic actions, allowing custom code) 
and understandability (the user can realistically understand the 
output; in fact readability would be a specific goal of mine)

Roman, regarding what you were saying to me earlier:
>In stage 2 you have only performed some basic analysis, like, 
>e.g., matched braces to define some hierarchy. This means that 
>at the time when you find a missing brace, for example, you 
>cannot tell anything more than that braces don't match.

Stage 2 actually can tell more than just "a brace is missing 
somewhere". Because so many languages are C-like. So given this 
situation:

    frob (c &% x)
       blip # gom;
    }

It doesn't need to know what language this is to tell where the 
brace belongs. Even in a more nebulous case like:

    frob (c &% x) bar @ lic
       blip # gom;
    }

probably the brace belongs at the end of the first line.

Perhaps your point is that there are situations where a parser 
that knows the "entire" grammar could make a better guess about 
where the missing brace/paren belongs. That's certainly true.

However, just because it can guess better, doesn't mean it can 
reinterpret the code based on that guess. I mean, I don't see any 
way to "back up" a parser by an arbitrary amount. A hypothetical 
stage 2 would probably be hand-written and could realistically 
back up and insert a brace/paren anywhere that the heuristics 
dictate, because it is producing a simple data structure and it 
doesn't need to do any semantic actions as it parses. A "full" 
parser, on the other hand, has done a lot of work that it can't 
undo, so the best it can do is report to the user "line 54: 
error: brace mismatch; did you forget a brace on line 13?" The 
heuristic is still helpful, but it has already parsed lines 13 to 
54 in the wrong context (and, in some cases, has already split 
out a series of error messages that are unrelated to the user's 
actual mistake).

> As I demonstrated in some examples, it could get the output 
> which implies incorrect structure

I was unable to find the examples you refer to... this thread's 
getting a little unweildy :)