Writing a Parser - Walnut and aPaGeD comments
Jascha Wetzel
firstname at mainia.de
Thu Jan 10 01:17:08 PST 2008
Dan wrote:
>> Yeah, i became aware of that through the feedback. Without thinking too
>> much, i assumed that using parsers would be something you'd do only if
>> you dealt with parsers intimately.
>
> I took linguistics, and I'm an interpreter writer, and I still haven't looked at BNF or YACC/BISON notation yet. To be honest, I'm not interested in it. Like MathML, it's way too far from the machine to generate an *efficient* parser. Mine might not be efficient, but that wouldn't be intrinsic to it being low-level.
That depends on the grammar that is being parsed. Simple grammars can
often be parsed faster with hand-optimized parsers. The more complex the
grammar is, the less impact the generated parsers' overhead has.
Apaged was tailored for parsing D, and it's very fast for that. Last
time i checked, it parsed the complete Tango package in less than 2
seconds (including disk io).
>> A Terminal is a symbol that is "final" wrt. expansion, while a
>> Non-Terminal can be expanded by some rules. Terminals, Tokens and Lexemes are *practically* more or less the same.
>
> Linguistics assigns them very different meanings.
Formal languages do too, but for practical use it's not much of a
difference.
>> If you think of parse trees when dealing with your grammar
>
> Are those like sentence trees, with noun phrase, verb phrase, etc?
Probably. I don't know linguistics. Example:
S -> A
A -> aBa
B -> bCb
C -> c
C -> A
parsing ababcbaba yields
S
|
_____ A _____
/ | \
a ___ B____ a
/ | \
b _ C__ b
/ | \
a B a
/|\
b C b
|
c
>>> - How to handle classic situations
>
> When I first started with Walnut 0.x, I was grinding my brain trying to figure out how to match these correctly:
>
> { { /* } */ } , { } }
it's a matter of what else is allowed in { }. besides, usually /* */
comments are handled as whitespace lexemes, solving the problem before
parsing.
> Another classical problem is JavaScript RegExp literals or divide:
>
> /bob/i can be "divide bob divide i", or a regexp, depending on whether we expect an operator or operand.
>
> How would you write that?
> How would the machine read that?
the whole problem of parsing such a thing doesn't arise until embedded
into the grammar. it therefore depends on what else interferes with this
syntax. i don't know exactly what's allowed in JavaScript, but you can
probably distinguish these expressions by the leading / - in a
arithmetic expression that isn't allowed, therefore the parser tries to
match a regexp expression. Very simplified:
Expr -> ReEx | ArEx
ArEx -> ArEx '/' ArEx | Identifier | NumericLiteral
ReEx -> '/' StringLiteral '/' OptParameters
Since neither Identifier nor NumericLiteral may start with '/' (i.e. '/'
is not in the first-set of ArEx), the grammar unambiguous.
More information about the Digitalmars-d-learn
mailing list