Writing a Parser - Walnut and aPaGeD comments

Thu Jan 10 01:17:08 PST 2008

Dan wrote:
>> Yeah, i became aware of that through the feedback. Without thinking too 
>> much, i assumed that using parsers would be something you'd do only if 
>> you dealt with parsers intimately. 
> 
> I took linguistics, and I'm an interpreter writer, and I still haven't looked at BNF or YACC/BISON notation yet.  To be honest, I'm not interested in it.  Like MathML, it's way too far from the machine to generate an *efficient* parser.  Mine might not be efficient, but that wouldn't be intrinsic to it being low-level.

That depends on the grammar that is being parsed. Simple grammars can 
often be parsed faster with hand-optimized parsers. The more complex the 
grammar is, the less impact the generated parsers' overhead has.
Apaged was tailored for parsing D, and it's very fast for that. Last 
time i checked, it parsed the complete Tango package in less than 2 
seconds (including disk io).

>> A Terminal is a symbol that is "final" wrt. expansion, while a 
>> Non-Terminal can be expanded by some rules. Terminals, Tokens and Lexemes are *practically* more or less the same. 
> 
> Linguistics assigns them very different meanings.

Formal languages do too, but for practical use it's not much of a 
difference.

>> If you think of parse trees when dealing with your grammar
> 
> Are those like sentence trees, with noun phrase, verb phrase, etc?

Probably. I don't know linguistics. Example:
S -> A
A -> aBa
B -> bCb
C -> c
C -> A

parsing ababcbaba yields
         S
         |
   _____ A _____
  /      |      \
a   ___ B____   a
    /    |    \
   b   _ C__   b
      /  |  \
     a   B   a
        /|\
       b C b
         |
         c

>>> - How to handle classic situations
> 
> When I first started with Walnut 0.x, I was grinding my brain trying to figure out how to match these correctly:
> 
> {  { /* } */ }  , { } }

it's a matter of what else is allowed in { }. besides, usually /* */ 
comments are handled as whitespace lexemes, solving the problem before 
parsing.

> Another classical problem is JavaScript RegExp literals or divide:
> 
> /bob/i  can be "divide bob divide i", or a regexp, depending on whether we expect an operator or operand.
> 
> How would you write that?
> How would the machine read that?

the whole problem of parsing such a thing doesn't arise until embedded 
into the grammar. it therefore depends on what else interferes with this 
syntax. i don't know exactly what's allowed in JavaScript, but you can 
probably distinguish these expressions by the leading / - in a 
arithmetic expression that isn't allowed, therefore the parser tries to 
match a regexp expression. Very simplified:
Expr -> ReEx | ArEx
ArEx -> ArEx '/' ArEx | Identifier | NumericLiteral
ReEx -> '/' StringLiteral '/' OptParameters

Since neither Identifier nor NumericLiteral may start with '/' (i.e. '/' 
is not in the first-set of ArEx), the grammar unambiguous.