Let's stop parser Hell

Tue Jul 31 22:44:33 PDT 2012

Thanks for taking the time to write such an explanation!

> Yeah. Once I started on it, I made a lot of progress really quickly. There's
> still a fair bit to do (primarily having to do with literals), but it probably
> won't take all that much longer. Certainly, I'd expect to have it done within
> a couple of weeks if not sooner, unless something goes wrong.

Is it based on a prioritized list of regexes?

> Well, it's still a work in progress, so it certainly can be adjusted as
> necessary. I intend it to fully implement the spec (and make sure that both it
> and the spec match what dmd is doing) as far as lexing goes. The idea is that
> you should be able to build a fully compliant D parser on top of it and build
> a fully compliant D compiler on top of that.

My fear (having encoded the D grammar once) is that 99% of the code
constructs are easily done (even new stuff like the => syntax), but
that the remaining 1% is a nightmare. The many different kind of
strings with or without escaping, for example. Your lexer return a
string literal as a string field, right? If so, escaping chars like "
and ` makes me nervous.

>
> It already supports all of the comment types and several of the string literal
> types. I haven't sorted out q{} yet, but I will before I'm done, and that may
> or may not affect how some things work, since I'm not quite sure how to handle
> q{} yet (it may end up being done with tokens marking the beginning and end of
> the token sequence encompassed by q{}, but we'll see).

I see. My question was because, as I said before I found this to be a
difficult and hairy part of the D grammar, and an IDE might want to
colorize/manipulate the content of a string, when it knows its a
string mixin.

> I don't have the code with me at the moment, but I believe that the token type
> looks something like
>
> struct Token
> {
>  TokenType type;
>  string str;
>  LiteralValue value;
>  SourcePos pos;
> }

Seems OK. I guess it can be templated on the string type, because you
sure want to lex dstrings.

> The type is an enum which gives the type of the token (obviously) which
> includes the various comment types and an error type (so errors are reported
> by getting a token that was an error token rather than throwing or anything
> like that, which should make lexing passed malformed stuff easy).

And the recovery is done on the next token. Good idea.

> str holds the exact text which was lexed (this includes the entire comment for
> the various comment token types), which in the case of lexing string rather
> than another range type would normally (always? - I don't remember) be a slice
> of the string being lexed, which should help make lexing string very efficient.

Yeahs, you may as well use this excellent feature. But that means the
input must be held in memory always, no? If it's an input range
(coming from a big file), can you keep slices from it?

> It may or may not make sense to change that to the range type used rather than
> string. For nesting block comments, the whole comment is one token (with the
> token type which is specifically for nested comments), regardless of whether
> there's any nesting going on. But that could be changed if there were a need
> to get separate tokens for the comments inside.

That's OK, as I reply in another message. It's now quite clear for me
how to deal with that.
IDE/doc generator can parse the comment at their leasure, as they are
not long nor critical section of code.
They can even easily get the code fragment used as examples inside
---/--- markers.

> LiteralValue is a VariantN of the types that a literal can be (long, ulong,
> real, and string IIRC) and is empty unless the token is a literal type

You can make it an Algebraic!(long, ulong, real, string), as your
universe of type is restricted. That can even catch a non-legal value.
Arrays literals are not considered literals, since they can contain
arbitrary expressions, is that right?

And it's a good idea to drop the complex literals once and for all.

> (the
> various string postfixes - c,w, and d - are treated as different token types
> rather than giving the literal value different string types - the same with the
> integral and floating point literals).

That's seem reasonable enough, but can you really store  a dstring
literal in a string field?

> And pos holds the position in the text where the token started, which should
> make it easy to use for syntax highlighting

Does syntax highlighting need more that a token stream? Without having
thought a lot about it, it seems to me IDE tend to highlight based
just on the token type, not on a parse tree. So that means your lexer
can be used directly by interested people, that's nice.

> and the like (as well as
> indicating where an error occurred). The initial position is passed as an
> optional argument to the lexing function, so it doesn't have to be 1:1 (though
> that's the default), and it allows you to select the tabwidth.

I really like the idea of having malformed token become error token,
with the lexing that continue. As I said before, that enables an IDE
to continue to do syntax highlighting.

> So, you'll pass a range and an optional starting position to the lexing
> function, and it'll return a lazy range of Tokens. Whitespace is stripped as
> part of the lexing process, but if you take the pos properties of two adjacent
> tokens, you should be able to determine how much whitespace was between them.

OK, so whitespace is strictly that, and comments are not dropped. That's OK.
I guess a D parser would then just filter the comments out of the token range.

> I _think_ that that's how it currently is, but again, I don't have the code
> with me at the moment, so it may not be 100% correct. And since it's a work in
> progress, it's certainly open to changes.

My only quip for now would be the string/dstring thingy.

Did you achieve UFCS, wrt floating point values? is 1.f seen as a
float or as f called on 1 (int)?