Let's stop parser Hell
David Piepgrass
qwertie256 at gmail.com
Sun Jul 8 16:37:19 PDT 2012
On Sunday, 8 July 2012 at 21:22:39 UTC, Roman D. Boiko wrote:
> On Sunday, 8 July 2012 at 21:03:41 UTC, Jonathan M Davis wrote:
>> It's been too long since I was actively working on parsers to
>> give any details, but it is my understanding that because a
>> hand-written parser is optimized for a specific grammar, it's
>> going to be faster.
>
> My aim is to find out any potential bottlenecks and ensure that
> those are possible to get rid of. So, let's try.
>
> I believe it would not hurt generality or quality of a parser
> generator if it contained sews for inserting custom (optimized)
> code where necessary, including those needed to take advantage
> of some particular aspects of D grammar. Thus I claim that
> optimization for D grammar is possible.
I'm convinced that the output of a parser generator (PG) can be
very nearly as fast as hand-written code. ANTLR's output (last I
checked) was not ideal, but the one I planned to make (a few
years ago) would have produced faster code.
By default the PG's output will not be the speed of hand-written
code, but the user can optimize it. Assuming an ANTLR-like PG,
the user can inspect the original output looking for inefficient
lookahead, or cases where the parser looks for rare cases before
common cases, and then improve the grammar and insert ... I
forget all the ANTLR terminology ... syntactic predicates or
whatever, to optimize the parser.
> So far discussion goes in favor of LL(*) parser like ANTLR,
> which is top-down recursive-descent. Either Pegged will be
> optimized with LL(*) algorithms, or a new parser generator
> created.
Right, for instance I am interested in writing a top-down PG
because I understand them better and prefer the top-down approach
due to its flexibility (semantic actions, allowing custom code)
and understandability (the user can realistically understand the
output; in fact readability would be a specific goal of mine)
Roman, regarding what you were saying to me earlier:
>In stage 2 you have only performed some basic analysis, like,
>e.g., matched braces to define some hierarchy. This means that
>at the time when you find a missing brace, for example, you
>cannot tell anything more than that braces don't match.
Stage 2 actually can tell more than just "a brace is missing
somewhere". Because so many languages are C-like. So given this
situation:
frob (c &% x)
blip # gom;
}
It doesn't need to know what language this is to tell where the
brace belongs. Even in a more nebulous case like:
frob (c &% x) bar @ lic
blip # gom;
}
probably the brace belongs at the end of the first line.
Perhaps your point is that there are situations where a parser
that knows the "entire" grammar could make a better guess about
where the missing brace/paren belongs. That's certainly true.
However, just because it can guess better, doesn't mean it can
reinterpret the code based on that guess. I mean, I don't see any
way to "back up" a parser by an arbitrary amount. A hypothetical
stage 2 would probably be hand-written and could realistically
back up and insert a brace/paren anywhere that the heuristics
dictate, because it is producing a simple data structure and it
doesn't need to do any semantic actions as it parses. A "full"
parser, on the other hand, has done a lot of work that it can't
undo, so the best it can do is report to the user "line 54:
error: brace mismatch; did you forget a brace on line 13?" The
heuristic is still helpful, but it has already parsed lines 13 to
54 in the wrong context (and, in some cases, has already split
out a series of error messages that are unrelated to the user's
actual mistake).
> As I demonstrated in some examples, it could get the output
> which implies incorrect structure
I was unable to find the examples you refer to... this thread's
getting a little unweildy :)
More information about the Digitalmars-d
mailing list