Writing a Parser - aPaGeD comments

Jascha Wetzel firstname at mainia.de
Wed Jan 9 01:50:17 PST 2008


Alan Knowles wrote:
> - It' works (yes, but you would be amazed, for the life of me I could 
> not get antlr to do anything - java write once, run nowhere...) - so 
> being able to generate code and working programs from the examples was 
> great, and a real +100 for the project..

I'm glad to hear that, although, i must admit, i was so bold to assume 
that already ;)

> - Documentation
> While I know it's a pain to write, the things you have already tend to 
> focus on how the parser is built, and are biased to someone 
> understanding the internals and phrase-ology involved in parsers, rather 
> than an end user - who just knows if I'm looking for this.. - then put 
> this, and the result is available in these variables:

Yeah, i became aware of that through the feedback. Without thinking too 
much, i assumed that using parsers would be something you'd do only if 
you dealt with parsers intimately. Therefore i felt it should be 
sufficient to describe everything that's specific to apaged.
Once i find the time, i'll extend the documentation with a thorough 
hands-on guide.

> Specifically I've no idea what the meanings of these are, and they are 
> rather critical to the docs....:
> "Terminal" "Non-Terminal"

A Terminal is a symbol that is "final" wrt. expansion, while a 
Non-Terminal can be expanded by some rules. Terminals, Tokens and 
Lexemes are *practically* more or less the same. In apaged, a Terminal 
is always a string or a regexp.
A Non-Terminal is specified by a set of production rules that tell how 
to match a sequence of Terminals and Non-Terminals. In apaged, a 
Non-Terminal is always something looking like this:

MyNT
{
   "a" MyOtherNT MyNT;
   "b" Nt2
   { /* D Code ... */ }
}

If you think of parse trees when dealing with your grammar (if not, 
ignore the rest), Non-Terminals are the inner nodes, and Terminals are 
the leaves.

> - Regex's
> While I can see the benefit's I'd much rather the compiler built them 
> for me.. - part of the beauty of the BNF format is that it's easy to 
> read, and explains regex style situations alot better.. - Otherwise (see 
> below about explaining how they can deal with classic situations...)

Do you mean regexp-like syntax for rules (i.e. EBNF syntax)?
This is a feature on my todo list, but i still have to find a nice way 
to merge this with the semantic code (which is non-trivial) - a thing 
for the future.

If you mean Regex's as in "asdf[a-z0-9]*", then i don't understand what 
you mean, since apaged does compile them for you. You don't need to use 
them at all, though. They're merely a convenience feature.

> - How to handle classic situations
> This is the key to success for the Documentation. (and what is seriously 
> missing) - as most people will probably have come from a lexx/yacc 
> background...
> 
> These are a few classic examples that the Documentation could do with.
> 
> * Top level parser starts.
> Most grammers start with a top level statement, eg.
> Program:
>     Statements;
> 
> In which case the application should only start by solving Statements, - 
> the biggest problem I found was that I had no idea how to stop it 
> matching any of the condition rules (that were only relivant to a 
> specific state - eg. see next example)

The docs state that
"The first non-terminal in a grammar file will be treated as the start 
symbol for the grammar."

I admit that it should be "...in the main grammar file..." since that 
doesn't apply for imported grammar files.

> * Parse a string
> This is a common pattern but it's quite difficult to see how to 
> implement it. -- And as above, when I tried, the parser started matching 
> DoubleQuotedStringChars at the start of the file (even though it's only 
> used in  DoubleQuotedString.
> 
> DoubleQuotedString;
>     QUOTE DoubleQuotedStringChars QUOTE
> 
> DoubleQuotedStringChars:
>     (DoubleQuotedStringChar)*
> 
> DoubleQuotedStringChar:
>     "\" ANYCHAR:
>     ^QUOTE;

Hm, it shouldn't do that. I'll assume that's because of how QUOTE is 
defined. I'd need to see the whole grammar.
Besides that, you use 2 unsupported EBNF features in the above grammar:
^QUOTE and
(DoubleQuotedStringChar)*
Apaged doesn't support full EBNF syntax for the reason mentioned above.

> * Classic groupings:
>     (.....)*    eg. many of these matches..
>     (.....)+    eg. one or more of these matches..
>     (.....)?    eg. one or none of these matches..
>     (.....)=> ...    if forward lookup succeeds on (...) try match next 
> combo.

None of those are supported. You need to write them BNF style:
A* becomes

As {
   As A;
   epsilon;
}

A+ becomes

Ap {
   Ap A;
   A;
}

A? becomes

Aq {
   A;
   epsilon;
}

Lookahead isn't supported for rules either. You may lookahead a single 
Terminal symbol, though, using >"a" or >["a" "b"], as documented.
Rule-level lookahead can also be rewritten BNF style, but it may affect 
more related rules and therefore depends on your instance of the problem.


I hope that helps a bit. If you read up on BNF vs. EBNF and consider 
that apaged is BNF based, you should find solutions to all of these 
problems.


More information about the Digitalmars-d-learn mailing list