Parsing D Maybe Not Such a Good Idea <_<;

Tue Jun 14 22:27:52 PDT 2016

On Wed, Jun 15, 2016 at 04:40:07AM +0000, cy via Digitalmars-d wrote:
> On Wednesday, 15 June 2016 at 04:25:15 UTC, deadalnix wrote:
> > there is nothing in that example that is especially difficult for a
> > machine to muddle through.
> 
> Nested comments, have to keep track of nesting depth specifically for
> those.  Single line comments, which give newline two totally different
> semantics based on context, that the parser has to account for.
> Comments can go absolutely anywhere, so you can never advance by
> simply incrementing your character pointer, regardless of the
> statement you're parsing. You have to do a complex test for the
> presence of comments and pass over them, and if you want to do any
> backtracking, you have to be able to skip backwards over them.

IMHO, you're thinking about this at the wrong level of abstraction.

The first order of business, before you even think about parsing, should
be to tokenize (or lex) the input. This is the stage where you break up
the input into atomic chunks: keywords, parentheses, commas,
identifiers, etc.. Comments can simply be scanned once and discarded.
The result of this process should be a stream of tokens, consisting of
the logical units of the language, i.e., keywords, punctuation,
identifiers, numbers, literals, etc.. At this stage you could optionally
store say, the last n tokens for some fixed number n for backtracking
purposes, and they don't have to be anything more than just slices of
the input string(s).

Once you have the token stream, then we can talk about parsing.
Generally for a language of D's complexity you'd want to use a parser
generator like yacc or bison, or some kind of hand-crafted recursive
descent parser with backtracking. But you're basically looking at an
LALR(1) grammar (or, in D's case, maybe LALR(n)), meaning that you have
to keep track of multiple possible parses as you scan the token stream,
until you reach a synchronization point where it narrows down to one of
the possibilities. IIRC, function declarations require backtracking
because of the compile-time argument syntax, which makes certain
constructs ambiguous until you've finished scanning at least the first
argument list.  This makes it more complicated for a human to implement,
but nothing scary as far as machine-generated parsers go -- C++ is far
more insane to parse, for example. (Some have quipped that C++ is a
language that must be parsed before it can be lexed. :-P)

T

-- 
Кто везде - тот нигде.