Let's stop parser Hell

Tue Jul 31 14:10:37 PDT 2012

On Tue, Jul 31, 2012 at 7:29 PM, Tobias Pankrath <tobias at pankrath.net> wrote:
> On Tuesday, 31 July 2012 at 16:10:14 UTC, Kai Meyer wrote:
>
>> I know I'm a little late coming into this conversation. This seems like a
>> nice thread to toss myself into. I've started working on a generic lexer
>> that is based off of a defined grammar.
>
>
> Every helping hand is appreciated :-)

Hi Kai, welcome here!

>> As I read through the thread (I unfortunately don't have enough time to
>> read every post, but I skimmed through as many as I could, and read the ones
>> that seemed important), it seems like we need a parser in D that can lex D,
>> and provide a Range of tokens that could be consumed.
>
>
> Yes. To make this statement more precise: We need a lexer that provides
> a range of tokens and we need a parser which makes it possible to build
> an AST. Pegged combines both approaches but imposes an overhead if you
> just need a token list. However I'm not sure if this is a problem.

I think we need a tokenizer generator and a parser generator. They can
be grouped or not, but we need both, in Phobos.

We also need to define what's needed in a token. Kai's approach is OK,
but what's the _type field for an operator or and identifier?
Also, a lexer can fill a symbol table and produce identifier tokens
that refer to a particular entry in the symbol table.

I guess creating a tree of symbol tables according to scope visibility
is then more the job of the parser, but I'm not sure.

> There are already some working D-Parsers buried in this thread.

Yes. We also need a D parser (not generic), but no one pushed one for
Phobos inclusion right now.
Anyway, we can put generic parsers in Phobos too (LALR, LL(*), etc),
but I'd say the first priority would be to have a D lexer (producing a
range of tokens) and a D parser (consuming a range of tokens, and
executing semantic actions, like the building of an AST). Generic
parsers can come later on. Having a D parser in the standard
distribution would create much goodwill. Many people want that.

> I've only glimpsed at your code. For most languages lexing is far more
> expensive then parsing

Is that so?

> and thus the lexer has to be very fast and I wouldn't
> pursue your approach and instead use something like ragel. It already has D
> output but needs a separate build step.

Having std.lexer in Phobos would be quite good. With a pre-compiled lexer for D.

The only problem I see in Kai's approach (which is the standard one, a
prioritized list of regexes) is how to recognize tokens that are not
regular (I mean, that cannot be encoded as a regex), like nesting
comments? How does the lexer know when to stop producing a 'comment'
token?

Philippe