Lexer and parser generators using CTFE

Wed Feb 29 08:19:42 PST 2012

On 2/28/12 8:58 AM, Hisayuki Mima wrote:
> Certainly, I don't write yet the documentation of my library, ctpg.
> (But a few examples available here:
> https://github.com/youkei/ctpg/tree/master/example)

Nice! I have a few comments. I should say that they entail a fair amount 
of work.

The current setup makes things very difficult for a common case - 
generating a parse tree. The user would need to insert a lot of custom 
code to create the appropriate nodes, but that's very easy for the 
parser because it already has the components. The parser should have a 
"build AST" option, in which case it should build all nodes without much 
effort from the user. That's what ANTLR does, and it has a special 
operator "^" to indicate where the root of the tree should be 
(http://www.antlr2.org/doc/trees.html).

So here's what I think your examples should look like:

The syntax could be changed a bit - it should be less like D and more 
like PEG. The "$" is implicit and shouldn't be needed. The ";" is a 
useful terminator for large productions spanning more than one line, so 
let's keep it. I don't understand the use of "!", for example the PEG 
for expression parsing at 
http://en.wikipedia.org/wiki/Parsing_expression_grammar is:

Value   ← [0-9]+ / '(' Expr ')'
Product ← Value (('*' / '/') Value)*
Sum     ← Product (('+' / '-') Product)*
Expr    ← Sum

but your grammar has:

int primary = !"(" addExp !")" / intLit_p;

whereas it seems to me it should be

int primary = "(" addExp ")" / intLit_p;

No?

Here's how I think a parser that also builds an AST looks like:

mixin generateParsers!(
   Options.createAST,
   q{
     root    <- addExp;
     addExp  <- mulExp (("+" / "-")^ addExp)?
     mulExp  <- primary (("*" / "/")^ mulExp)?
     primary <- "(" addExp ")"% / intLit_p;
   }
);

I used the following notation: a node suffixed with "^" becomes the root 
of the produced AST, and a node suffixed with "%" will be dropped 
altogether (that's because ")" is not needed in the AST after the 
expression has been parsed). With only three characters I informed the 
parser what the shape of the AST will be.

At this point calling parse!root("1+2*3") would return an object of type 
ASTNode!"addExp", which in turn has two children, lhs and rhs. The lhs 
node has type ASTNode!"intLit_p" which has inside a value "1". The rhs 
node has type ASTNode!"mulExp", and that in turn has two children nodes 
lhs and rhs, both of type ASTNode!"intLit_p", each with its own value 
("2" and "3", respectively). All these nodes are dynamically created by 
the parsing process and inherit a common type, e.g. ASTNode!"" or some 
type ASTRootNode.

> So, I'd like to describe it here.
>
> To be honest, ctpg is inspired by one module of Scala's standard
> library, Parser Combinators.
> One of the differences between Parser Combinators and ctpg is that
> Parser Combinators generates parsers in run-time, but ctpg generates
> parsers in compile-time by the power of CTFE and mixin.
>
> A parser generated by ctpg is a recursive descent parser, so it does
> lexical analysis and parsing at a time.

Oh, I forgot this property of PEGs - integrated lexing and parsing. So 
no need for a separate lexer!

> And the type of input which it
> can accept is string, wstring, dstring and ForwardRange whose element
> type is char, wchar or dchar.

Great. I think we should actually define the notion of BufferedRange. 
I'll get to that topic later.

> So, dividing lexical analysis and parsing into two processes is
> difficult in ctpg.

Yes, sorry I forgot about that!

> Importantly, a parser generated by ctpg is statically typed as one of
> the examples, "parsing simple arithmetic expression" shows.
> Generally a parser generated by other tool and accepting tokens returns
> the abstract syntax tree, but it return the evaluated value in the example.
> In other words, it does lexical analysis, parsing and (type) converting
> at a time.
> If you want simply abstract syntax tree, it may be a little pain to use
> ctpg.
>
> That's all I'd like to say about ctpg here.

Why would it be difficult to integrate AST creation with parsing? I'm 
not sure I understand this. My understanding is that you should be able 
to add a couple of simple operators ("^" and "%" as described above) to 
inform AST creation, and you're done!

Thanks,

Andrei