Lexer and parser generators using CTFE
Andrei Alexandrescu
SeeWebsiteForEmail at erdani.org
Wed Feb 29 08:19:42 PST 2012
On 2/28/12 8:58 AM, Hisayuki Mima wrote:
> Certainly, I don't write yet the documentation of my library, ctpg.
> (But a few examples available here:
> https://github.com/youkei/ctpg/tree/master/example)
Nice! I have a few comments. I should say that they entail a fair amount
of work.
The current setup makes things very difficult for a common case -
generating a parse tree. The user would need to insert a lot of custom
code to create the appropriate nodes, but that's very easy for the
parser because it already has the components. The parser should have a
"build AST" option, in which case it should build all nodes without much
effort from the user. That's what ANTLR does, and it has a special
operator "^" to indicate where the root of the tree should be
(http://www.antlr2.org/doc/trees.html).
So here's what I think your examples should look like:
The syntax could be changed a bit - it should be less like D and more
like PEG. The "$" is implicit and shouldn't be needed. The ";" is a
useful terminator for large productions spanning more than one line, so
let's keep it. I don't understand the use of "!", for example the PEG
for expression parsing at
http://en.wikipedia.org/wiki/Parsing_expression_grammar is:
Value ← [0-9]+ / '(' Expr ')'
Product ← Value (('*' / '/') Value)*
Sum ← Product (('+' / '-') Product)*
Expr ← Sum
but your grammar has:
int primary = !"(" addExp !")" / intLit_p;
whereas it seems to me it should be
int primary = "(" addExp ")" / intLit_p;
No?
Here's how I think a parser that also builds an AST looks like:
mixin generateParsers!(
Options.createAST,
q{
root <- addExp;
addExp <- mulExp (("+" / "-")^ addExp)?
mulExp <- primary (("*" / "/")^ mulExp)?
primary <- "(" addExp ")"% / intLit_p;
}
);
I used the following notation: a node suffixed with "^" becomes the root
of the produced AST, and a node suffixed with "%" will be dropped
altogether (that's because ")" is not needed in the AST after the
expression has been parsed). With only three characters I informed the
parser what the shape of the AST will be.
At this point calling parse!root("1+2*3") would return an object of type
ASTNode!"addExp", which in turn has two children, lhs and rhs. The lhs
node has type ASTNode!"intLit_p" which has inside a value "1". The rhs
node has type ASTNode!"mulExp", and that in turn has two children nodes
lhs and rhs, both of type ASTNode!"intLit_p", each with its own value
("2" and "3", respectively). All these nodes are dynamically created by
the parsing process and inherit a common type, e.g. ASTNode!"" or some
type ASTRootNode.
> So, I'd like to describe it here.
>
> To be honest, ctpg is inspired by one module of Scala's standard
> library, Parser Combinators.
> One of the differences between Parser Combinators and ctpg is that
> Parser Combinators generates parsers in run-time, but ctpg generates
> parsers in compile-time by the power of CTFE and mixin.
>
> A parser generated by ctpg is a recursive descent parser, so it does
> lexical analysis and parsing at a time.
Oh, I forgot this property of PEGs - integrated lexing and parsing. So
no need for a separate lexer!
> And the type of input which it
> can accept is string, wstring, dstring and ForwardRange whose element
> type is char, wchar or dchar.
Great. I think we should actually define the notion of BufferedRange.
I'll get to that topic later.
> So, dividing lexical analysis and parsing into two processes is
> difficult in ctpg.
Yes, sorry I forgot about that!
> Importantly, a parser generated by ctpg is statically typed as one of
> the examples, "parsing simple arithmetic expression" shows.
> Generally a parser generated by other tool and accepting tokens returns
> the abstract syntax tree, but it return the evaluated value in the example.
> In other words, it does lexical analysis, parsing and (type) converting
> at a time.
> If you want simply abstract syntax tree, it may be a little pain to use
> ctpg.
>
> That's all I'd like to say about ctpg here.
Why would it be difficult to integrate AST creation with parsing? I'm
not sure I understand this. My understanding is that you should be able
to add a couple of simple operators ("^" and "%" as described above) to
inform AST creation, and you're done!
Thanks,
Andrei
More information about the Digitalmars-d
mailing list