Lexer and parser generators using CTFE

Wed Feb 29 01:48:30 PST 2012

On 29.02.2012 0:19, H. S. Teoh wrote:
> On Tue, Feb 28, 2012 at 11:52:52PM +0400, Dmitry Olshansky wrote:
>> On 28.02.2012 22:46, Martin Nowak wrote:
>>> On Tue, 28 Feb 2012 08:59:21 +0100, Andrei Alexandrescu
>>> <SeeWebsiteForEmail at erdani.org>  wrote:
> [...]
>>>> We need to have a easy-to-use, complete, seamless, and efficient
>>>> lexer-parser generator combo in Phobos, pronto. The lexer itself
>>>> could use a character-level PEG or a classic automaton, and emit
>>>> tokens for consumption by a parser generator. The two should work in
>>>> perfect tandem (no need for glue code). At the end of the day,
>>>> defining a complete lexer+parser combo for a language should be just
>>>> a few lines longer than the textual representation of the grammar
>>>> itself.
> [...]
>> To be truly successful it should have some distinct properties like:
>> - being faster or identical to handwritten stuff we already have
>> (like e.g. std.json ), reminds us of recent INI parser proposal,
>> should be an easy pick for general parser.
>
> We probably have to use string mixins to achieve this.
>
> OTOH, a generic parser generator is unlikely to be able to handle
> special case optimizations such as operator-precedence parsers
> (http://en.wikipedia.org/wiki/Operator-precedence_parser), because
> operator precedence is only implied by grammar rules, and to optimize
> for this case requires explicitly specifying each operator's precedence.
> There is some price to be paid for generality, unfortunately.
>

Well I see no problem in defining a separate OperatorGrammar 
constructor, that will take base Unit production and a bunch of 
operators with arity, plus brackets with their priority is plenty. 
Semantic then actions are trivially defined for each of operators 
respectively.

Someway of stacking grammars is bonus points for the project. As Andrei 
started with lexer+parser, I see no problem with it being 
lexer+parser(+parser)*.

>
>> - be flexible, the result need not be a predefined AST tree nodes,
>> Hisayuki Mima's parser shows some nice movement in this direction
>
> In my mind, D's delegates are ideal for this sort of thing. Each grammar
> rule will have an associated callback. We can provide default callbacks
> that generates AST nodes, but the user can override them with his own
> delegates to do whatever they want (evaluate expressions/commands on the
> fly, build only AST subtrees of interest, etc.).
>

Yup delegates and mixins are the tools. More over lambdas shine here, 
less clutter + type deduction.

>
>> - have reasonable compile times and memory consumption (though it will
>> only improve over time)
>
> This will be good motivation to improve dmd's CTFE implementation. ;-)
>
>
> [...]
>> Error reporting is the weak spot, still no good ideas on solving that.
>
> This is not specific to D, though, right?

Such a framework relying on mixins is bound to produce awful error 
messages at compile-time, unless explicitly stuffed up to watterline 
with some kind of static assert(__traits(compiles,...),"Error x + info");

>
>
> [...]
>>> I do not agree that it's the right moment to write a parser though.
>>> It hits the first of phobos two biggest shortcomings, the lack of a
>>> good I/O system and the missing Allocators.  Any parser written now
>>> will either risk to not play nice with ranges or has to come up with
>>> it's own buffering again.
>>
>> All good points. There is prototype of interactive regex matcher
>> that works directly on stream (buried in std.regex), it even passed
>> dry-run unittests back then. Though I had to postpone till I/O is
>> sorted out. I really loved Steven's design with it's easy access to
>> buffer and well thought out primitives, hope it will come about
>> sometime soon.
> [...]
>
> Are you talking about std.io? I glanced over that code briefly recently;
> it looks very promising. Hope it will make it into Phobos soon. We
> *need* to integrate streams, ranges, and the stdio writeln, etc., stuff.
> The current situation to an outsider looks quite bad (streams not
> compatible with stdio, no easy way to substitute string buffer for
> files, etc.).

Aye, that's it. Not to mention DIP9, though that's another topic in itself.

>
> We also need automatic UTF encoding detection for all input
> streams/files/ranges/whatever. Including mismatching endianness (i.e.
> auto byte-swapping).  In a recent personal project to write a D lexer
> (as an exercise to help me learn D better), I ended up having to roll my
> own input stream abstraction in order to handle automatic encoding
> detection. Which is quite poorly written, I've to admit. Something like
> this needs to be standardized in Phobos with a well-written
> implementation.
>
>
> T
>

-- 
Dmitry Olshansky