Lexer and parser generators using CTFE

Tue Feb 28 12:19:34 PST 2012

On Tue, Feb 28, 2012 at 11:52:52PM +0400, Dmitry Olshansky wrote:
> On 28.02.2012 22:46, Martin Nowak wrote:
> >On Tue, 28 Feb 2012 08:59:21 +0100, Andrei Alexandrescu
> ><SeeWebsiteForEmail at erdani.org> wrote:
[...]
> >>We need to have a easy-to-use, complete, seamless, and efficient
> >>lexer-parser generator combo in Phobos, pronto. The lexer itself
> >>could use a character-level PEG or a classic automaton, and emit
> >>tokens for consumption by a parser generator. The two should work in
> >>perfect tandem (no need for glue code). At the end of the day,
> >>defining a complete lexer+parser combo for a language should be just
> >>a few lines longer than the textual representation of the grammar
> >>itself.
[...]
> To be truly successful it should have some distinct properties like:
> - being faster or identical to handwritten stuff we already have
> (like e.g. std.json ), reminds us of recent INI parser proposal,
> should be an easy pick for general parser.

We probably have to use string mixins to achieve this.

OTOH, a generic parser generator is unlikely to be able to handle
special case optimizations such as operator-precedence parsers
(http://en.wikipedia.org/wiki/Operator-precedence_parser), because
operator precedence is only implied by grammar rules, and to optimize
for this case requires explicitly specifying each operator's precedence.
There is some price to be paid for generality, unfortunately.

> - be flexible, the result need not be a predefined AST tree nodes,
> Hisayuki Mima's parser shows some nice movement in this direction

In my mind, D's delegates are ideal for this sort of thing. Each grammar
rule will have an associated callback. We can provide default callbacks
that generates AST nodes, but the user can override them with his own
delegates to do whatever they want (evaluate expressions/commands on the
fly, build only AST subtrees of interest, etc.).

> - have reasonable compile times and memory consumption (though it will
> only improve over time)

This will be good motivation to improve dmd's CTFE implementation. ;-)

[...]
> Error reporting is the weak spot, still no good ideas on solving that.

This is not specific to D, though, right?

[...]
> >I do not agree that it's the right moment to write a parser though.
> >It hits the first of phobos two biggest shortcomings, the lack of a
> >good I/O system and the missing Allocators.  Any parser written now
> >will either risk to not play nice with ranges or has to come up with
> >it's own buffering again.
> 
> All good points. There is prototype of interactive regex matcher
> that works directly on stream (buried in std.regex), it even passed
> dry-run unittests back then. Though I had to postpone till I/O is
> sorted out. I really loved Steven's design with it's easy access to
> buffer and well thought out primitives, hope it will come about
> sometime soon.
[...]

Are you talking about std.io? I glanced over that code briefly recently;
it looks very promising. Hope it will make it into Phobos soon. We
*need* to integrate streams, ranges, and the stdio writeln, etc., stuff.
The current situation to an outsider looks quite bad (streams not
compatible with stdio, no easy way to substitute string buffer for
files, etc.).

We also need automatic UTF encoding detection for all input
streams/files/ranges/whatever. Including mismatching endianness (i.e.
auto byte-swapping).  In a recent personal project to write a D lexer
(as an exercise to help me learn D better), I ended up having to roll my
own input stream abstraction in order to handle automatic encoding
detection. Which is quite poorly written, I've to admit. Something like
this needs to be standardized in Phobos with a well-written
implementation.

T

-- 
"You know, maybe we don't *need* enemies." "Yeah, best friends are about
all I can take." -- Calvin & Hobbes