Lexer and parser generators using CTFE

Sun Mar 4 06:34:32 PST 2012

(2012/03/01 1:19), Andrei Alexandrescu wrote:
> On 2/28/12 8:58 AM, Hisayuki Mima wrote:
>> Certainly, I don't write yet the documentation of my library, ctpg.
>> (But a few examples available here:
>> https://github.com/youkei/ctpg/tree/master/example)
>
> Nice! I have a few comments. I should say that they entail a fair amount
> of work.
>
> The current setup makes things very difficult for a common case -
> generating a parse tree. The user would need to insert a lot of custom
> code to create the appropriate nodes, but that's very easy for the
> parser because it already has the components. The parser should have a
> "build AST" option, in which case it should build all nodes without much
> effort from the user. That's what ANTLR does, and it has a special
> operator "^" to indicate where the root of the tree should be
> (http://www.antlr2.org/doc/trees.html).
>
> So here's what I think your examples should look like:
>
> The syntax could be changed a bit - it should be less like D and more
> like PEG. The "$" is implicit and shouldn't be needed. The ";" is a
> useful terminator for large productions spanning more than one line, so
> let's keep it. I don't understand the use of "!", for example the PEG
> for expression parsing at
> http://en.wikipedia.org/wiki/Parsing_expression_grammar is:
>
> Value ← [0-9]+ / '(' Expr ')'
> Product ← Value (('*' / '/') Value)*
> Sum ← Product (('+' / '-') Product)*
> Expr ← Sum
>
> but your grammar has:
>
> int primary = !"(" addExp !")" / intLit_p;
>
> whereas it seems to me it should be
>
> int primary = "(" addExp ")" / intLit_p;
>
> No?
>
> Here's how I think a parser that also builds an AST looks like:
>
> mixin generateParsers!(
> Options.createAST,
> q{
> root <- addExp;
> addExp <- mulExp (("+" / "-")^ addExp)?
> mulExp <- primary (("*" / "/")^ mulExp)?
> primary <- "(" addExp ")"% / intLit_p;
> }
> );
>
> I used the following notation: a node suffixed with "^" becomes the root
> of the produced AST, and a node suffixed with "%" will be dropped
> altogether (that's because ")" is not needed in the AST after the
> expression has been parsed). With only three characters I informed the
> parser what the shape of the AST will be.
>
> At this point calling parse!root("1+2*3") would return an object of type
> ASTNode!"addExp", which in turn has two children, lhs and rhs. The lhs
> node has type ASTNode!"intLit_p" which has inside a value "1". The rhs
> node has type ASTNode!"mulExp", and that in turn has two children nodes
> lhs and rhs, both of type ASTNode!"intLit_p", each with its own value
> ("2" and "3", respectively). All these nodes are dynamically created by
> the parsing process and inherit a common type, e.g. ASTNode!"" or some
> type ASTRootNode.
>
>> So, I'd like to describe it here.
>>
>> To be honest, ctpg is inspired by one module of Scala's standard
>> library, Parser Combinators.
>> One of the differences between Parser Combinators and ctpg is that
>> Parser Combinators generates parsers in run-time, but ctpg generates
>> parsers in compile-time by the power of CTFE and mixin.
>>
>> A parser generated by ctpg is a recursive descent parser, so it does
>> lexical analysis and parsing at a time.
>
> Oh, I forgot this property of PEGs - integrated lexing and parsing. So
> no need for a separate lexer!
>
>> And the type of input which it
>> can accept is string, wstring, dstring and ForwardRange whose element
>> type is char, wchar or dchar.
>
> Great. I think we should actually define the notion of BufferedRange.
> I'll get to that topic later.
>
>> So, dividing lexical analysis and parsing into two processes is
>> difficult in ctpg.
>
> Yes, sorry I forgot about that!
>
>> Importantly, a parser generated by ctpg is statically typed as one of
>> the examples, "parsing simple arithmetic expression" shows.
>> Generally a parser generated by other tool and accepting tokens returns
>> the abstract syntax tree, but it return the evaluated value in the
>> example.
>> In other words, it does lexical analysis, parsing and (type) converting
>> at a time.
>> If you want simply abstract syntax tree, it may be a little pain to use
>> ctpg.
>>
>> That's all I'd like to say about ctpg here.
>
> Why would it be difficult to integrate AST creation with parsing? I'm
> not sure I understand this. My understanding is that you should be able
> to add a couple of simple operators ("^" and "%" as described above) to
> inform AST creation, and you're done!
>
>
> Thanks,
>
> Andrei
>

Certainly, "built AST" option is very interesting.
I don't know whether building AST is a common case or not because I'm 
unfamiliar with syntax analysis.
But I'd like to complete ctpg as D version of boost::spirit::qi first.
Currently, ctpg can be used probably like boost::spirit::qi. (Both do 
typed syntax analysis.)
There are some points where ctpg is superior to boost::spirit::qi. For 
example, The source code written by using ctpg is simple because it is 
like PEG.
In addition, I'm planning to improve error messages by making excellent 
use of #line and __LINE__.
I'd like to write the library which is functionally same 
boost::spirit::qi and written in D style.
Hence, I'd like to make implementing "built AST" option i.e. untyped 
syntax analysis a low priority.
What is your idea?

P.S. prefix operator "!" is same as postfix operator "%" which you said.
The types which parsers on both sides of "/" operator have should be 
compatible.
The type of "(" addExp ")" is Tuple!(string, int, string) and the type 
of intLit_p is int. They are not compatible.
The type of !"(" addExp !")" is int. So the types of !"(" addExp !")" 
and intLit_p are compatible.

Thanks,
Hisayuki Mima