Writing a Parser - Walnut and aPaGeD comments

Sat Jan 12 03:34:43 PST 2008

Jascha Wetzel Wrote:

> Dan wrote:
>>Like MathML, it's way too far from the machine to generate an *efficient* parser.  
> That depends on the grammar that is being parsed. Simple grammars can 
> often be parsed faster with hand-optimized parsers. The more complex the 
> grammar is, the less impact the generated parsers' overhead has.

Ah, but there's always overhead.

> Apaged was tailored for parsing D, and it's very fast for that. Last 
> time i checked, it parsed the complete Tango package in less than 2 
> seconds (including disk io).

Eep.  That would instantly make any JavaScript interpreter a failure; scripts need to *run* in the < 500ms (unnoticeable) range to even be considered.

> it's a matter of what else is allowed in { }. besides, usually /* */ 
> comments are handled as whitespace lexemes, solving the problem before 
> parsing.

Aha!  Well then, the way I wrote my scanner/parser, a whole tree is built before parsing.  It's not fully functional yet, but I'm not seeing any design failures.

> 
> > Another classical problem is JavaScript RegExp literals or divide:
> > 
> > /bob/i  can be "divide bob divide i", or a regexp, depending on whether we expect an operator or operand.
> > 
> > How would you write that?
> > How would the machine read that?
> 
> the whole problem of parsing such a thing doesn't arise until embedded 
> into the grammar. it therefore depends on what else interferes with this 
> syntax. i don't know exactly what's allowed in JavaScript, but you can 
> probably distinguish these expressions by the leading / - in a 
> arithmetic expression that isn't allowed, therefore the parser tries to 
> match a regexp expression. Very simplified:
> Expr -> ReEx | ArEx
> ArEx -> ArEx '/' ArEx | Identifier | NumericLiteral
> ReEx -> '/' StringLiteral '/' OptParameters
> 
> Since neither Identifier nor NumericLiteral may start with '/' (i.e. '/' 
> is not in the first-set of ArEx), the grammar unambiguous.

So you mean to say, it does it by looking at the tokens before and after and inferring the value based on whether an operator or operand is expected?

That makes sense.  I did it similar to that.