Writing a Parser - Walnut and aPaGeD comments

Sat Jan 12 04:55:25 PST 2008

Dan wrote:
>> That depends on the grammar that is being parsed. Simple grammars can 
>> often be parsed faster with hand-optimized parsers. The more complex the 
>> grammar is, the less impact the generated parsers' overhead has.
> 
> Ah, but there's always overhead.

What i meant was, that this overhead has a purpose. Once your grammar 
uses it, it's not overhead anymore.
Automatically generated parsers usually facilitate bottom-up parsing 
algorithms (LALR, GLR) that have several advantages over top-down 
algorithms (LL(k), usually implemented as recursive descent) that are 
used in manually written parsers.
LALR for example is completely iterative (i.e. no stack action, no 
calls, less code -> completely cached), making it more efficient than 
recursive descent. GLR is almost completely iterative, depending on the 
implementation.

If an automatically generated parser is slower, it is usually due to 
features that make it more flexible wrt. to grammars. Therefore, as 
stated initially, once you need this flexibility, the parser performs 
nicely.

Another thing is backtracking. RD can use lookahead, but it is more 
elaborate to implement it manually and generally, top-down lookahead 
isn't as effective as bottom-up lookahead. That is, it takes more 
complex grammars to create the need for backtracking in a bottom-up 
parser than in a top-down parser. That is where bottom-up parsing is 
categorically faster than top-down.

All this makes me claim that it'll be hard to write for example a 
recursive descent D parser by hand, that performs better than it's 
automatically generated GLR counterpart. You might still succeed by a 
small margin, but you will have spent a *lot* more time to do so. And 
you will have a parser that is not near as easily maintainable as the 
grammar used for the automatically generated parser.

>> Apaged was tailored for parsing D, and it's very fast for that. Last 
>> time i checked, it parsed the complete Tango package in less than 2 
>> seconds (including disk io).
> 
> Eep.  That would instantly make any JavaScript interpreter a failure; scripts need to *run* in the < 500ms (unnoticeable) range to even be considered.

The tango package has >400 files with multiple MB of code. I doubt there 
is any script or library written in JavaScript that is even close to 
that size. Parsing single files <1000 LoC is a matter of <1ms with 
apaged or any other LALR/GLR parser, no need to worry ;).

> So you mean to say, it does it by looking at the tokens before and after and inferring the value based on whether an operator or operand is expected?
> 
> That makes sense.  I did it similar to that.

In this case it even only needs to look at the next token. You're 
usually right with what you do to solve such a case if you don't need to 
backtrack. If so, you should double check that there is no other way.