Writing a Parser - aPaGeD comments
Jascha Wetzel
firstname at mainia.de
Wed Jan 9 01:50:17 PST 2008
Alan Knowles wrote:
> - It' works (yes, but you would be amazed, for the life of me I could
> not get antlr to do anything - java write once, run nowhere...) - so
> being able to generate code and working programs from the examples was
> great, and a real +100 for the project..
I'm glad to hear that, although, i must admit, i was so bold to assume
that already ;)
> - Documentation
> While I know it's a pain to write, the things you have already tend to
> focus on how the parser is built, and are biased to someone
> understanding the internals and phrase-ology involved in parsers, rather
> than an end user - who just knows if I'm looking for this.. - then put
> this, and the result is available in these variables:
Yeah, i became aware of that through the feedback. Without thinking too
much, i assumed that using parsers would be something you'd do only if
you dealt with parsers intimately. Therefore i felt it should be
sufficient to describe everything that's specific to apaged.
Once i find the time, i'll extend the documentation with a thorough
hands-on guide.
> Specifically I've no idea what the meanings of these are, and they are
> rather critical to the docs....:
> "Terminal" "Non-Terminal"
A Terminal is a symbol that is "final" wrt. expansion, while a
Non-Terminal can be expanded by some rules. Terminals, Tokens and
Lexemes are *practically* more or less the same. In apaged, a Terminal
is always a string or a regexp.
A Non-Terminal is specified by a set of production rules that tell how
to match a sequence of Terminals and Non-Terminals. In apaged, a
Non-Terminal is always something looking like this:
MyNT
{
"a" MyOtherNT MyNT;
"b" Nt2
{ /* D Code ... */ }
}
If you think of parse trees when dealing with your grammar (if not,
ignore the rest), Non-Terminals are the inner nodes, and Terminals are
the leaves.
> - Regex's
> While I can see the benefit's I'd much rather the compiler built them
> for me.. - part of the beauty of the BNF format is that it's easy to
> read, and explains regex style situations alot better.. - Otherwise (see
> below about explaining how they can deal with classic situations...)
Do you mean regexp-like syntax for rules (i.e. EBNF syntax)?
This is a feature on my todo list, but i still have to find a nice way
to merge this with the semantic code (which is non-trivial) - a thing
for the future.
If you mean Regex's as in "asdf[a-z0-9]*", then i don't understand what
you mean, since apaged does compile them for you. You don't need to use
them at all, though. They're merely a convenience feature.
> - How to handle classic situations
> This is the key to success for the Documentation. (and what is seriously
> missing) - as most people will probably have come from a lexx/yacc
> background...
>
> These are a few classic examples that the Documentation could do with.
>
> * Top level parser starts.
> Most grammers start with a top level statement, eg.
> Program:
> Statements;
>
> In which case the application should only start by solving Statements, -
> the biggest problem I found was that I had no idea how to stop it
> matching any of the condition rules (that were only relivant to a
> specific state - eg. see next example)
The docs state that
"The first non-terminal in a grammar file will be treated as the start
symbol for the grammar."
I admit that it should be "...in the main grammar file..." since that
doesn't apply for imported grammar files.
> * Parse a string
> This is a common pattern but it's quite difficult to see how to
> implement it. -- And as above, when I tried, the parser started matching
> DoubleQuotedStringChars at the start of the file (even though it's only
> used in DoubleQuotedString.
>
> DoubleQuotedString;
> QUOTE DoubleQuotedStringChars QUOTE
>
> DoubleQuotedStringChars:
> (DoubleQuotedStringChar)*
>
> DoubleQuotedStringChar:
> "\" ANYCHAR:
> ^QUOTE;
Hm, it shouldn't do that. I'll assume that's because of how QUOTE is
defined. I'd need to see the whole grammar.
Besides that, you use 2 unsupported EBNF features in the above grammar:
^QUOTE and
(DoubleQuotedStringChar)*
Apaged doesn't support full EBNF syntax for the reason mentioned above.
> * Classic groupings:
> (.....)* eg. many of these matches..
> (.....)+ eg. one or more of these matches..
> (.....)? eg. one or none of these matches..
> (.....)=> ... if forward lookup succeeds on (...) try match next
> combo.
None of those are supported. You need to write them BNF style:
A* becomes
As {
As A;
epsilon;
}
A+ becomes
Ap {
Ap A;
A;
}
A? becomes
Aq {
A;
epsilon;
}
Lookahead isn't supported for rules either. You may lookahead a single
Terminal symbol, though, using >"a" or >["a" "b"], as documented.
Rule-level lookahead can also be rewritten BNF style, but it may affect
more related rules and therefore depends on your instance of the problem.
I hope that helps a bit. If you read up on BNF vs. EBNF and consider
that apaged is BNF based, you should find solutions to all of these
problems.
More information about the Digitalmars-d-learn
mailing list