Lexer and parser generators using CTFE

Tue Feb 28 11:52:52 PST 2012

On 28.02.2012 22:46, Martin Nowak wrote:
> On Tue, 28 Feb 2012 08:59:21 +0100, Andrei Alexandrescu
> <SeeWebsiteForEmail at erdani.org> wrote:
>
>> I'm starting a new thread on this because I think the matter is of
>> strategic importance.
>>
>> We all felt for a long time that there's a lot of potential in CTFE,
>> and potential applications have been discussed more than a few times,
>> ranging from formatting strings parsed to DSLs and parser generators.
>>
>> Such feats are now approaching fruition because a number of factors
>> converge:
>>
>> * Dmitry Olshansky's regex library (now in Phobos) generates efficient
>> D code straight from regexen.
>>
>> * The scope and quality of CTFE has improved enormously, making more
>> advanced uses possible and even relatively easy (thanks Don!)
>>
>> * Hisayuki Mima implemented a parser generator in only 3000 lines of
>> code (sadly, no comments or documentation yet :o))
>>
>> * With the occasion of that announcement we also find out Philippe
>> Sigaud has already a competing design and implementation of a parser
>> generator.
>>
>> This is the kind of stuff I've had an eye on for the longest time. I'm
>> saying it's of strategic importance because CTFE technology, though
>> not new and already available with some languages, has unique powers
>> when combined with other features of D. With CTFE we get to do things
>> that are quite literally impossible to do in other languages.
>>
>> We need to have a easy-to-use, complete, seamless, and efficient
>> lexer-parser generator combo in Phobos, pronto. The lexer itself could
>> use a character-level PEG or a classic automaton, and emit tokens for
>> consumption by a parser generator. The two should work in perfect
>> tandem (no need for glue code). At the end of the day, defining a
>> complete lexer+parser combo for a language should be just a few lines
>> longer than the textual representation of the grammar itself.
>>
>> What do you all think? Let's get this project off the ground!
>>
>>

To be truly successful it should have some distinct properties like:
- being faster or identical to handwritten stuff we already have (like 
e.g. std.json ), reminds us of recent INI parser proposal, should be an 
easy pick for general parser.
- be flexible, the result need not be a predefined AST tree nodes, 
Hisayuki Mima's parser shows some nice movement in this direction
- have reasonable compile times and memory consumption (though it will 
only improve over time)

Recalling EBNF parser idea that I run with before finally being dragged 
down by real life. Roughly I thought to follow hybrid LL(*) aproach, 
while I had a solid plan on doing DFA for lexer and parser lookahead, 
the other things were more or less floating.

>> Thanks,
>>
>> Andrei
>
> I wrote a generic lexer generator some time ago.
> It already let to some compiler O(N^2) optimizations, because the token
> declarations sneak into the mangling :(.
> I also finally added a workaround for a remaining CTFE bug (#6815).
>
> https://gist.github.com/1255439 - lexer generator

Cool stuff.

> https://gist.github.com/1262321 - complete and fast D lexer
>
> I've ditched an attempt to write a parser combinator. It was based on
> expression templates and ended up at spirit craziness.
>
> <PERSONAL OPINION
> The hassle of providing good error messages and synthesizing parse results
> in a generic parser outweigh the benefit of a declarative grammar.
> /PERSONAL OPINION>
>

Error reporting is the weak spot, still no good ideas on solving that.

> A lot becomes feasible from the CTFE perspective,
> despite some bugfixes I only miss exp and log currently.
>

Reminds me that I have to dig up a wonderful CTFE bug in std.regex 
that's being somehow worked around by duping arrays since August.

> I do not agree that it's the right moment to write a parser though.
> It hits the first of phobos two biggest shortcomings, the lack of a good
> I/O
> system and the missing Allocators.
> Any parser written now will either risk to not play nice with ranges
> or has to come up with it's own buffering again.

All good points. There is prototype of interactive regex matcher that 
works directly on stream (buried in std.regex), it even passed dry-run 
unittests back then. Though I had to postpone till I/O is sorted out. I 
really loved Steven's design with it's easy access to buffer and well 
thought out primitives, hope it will come about sometime soon.

-- 
Dmitry Olshansky