Let's stop parser Hell

Tue Jul 31 23:39:41 PDT 2012

On Wednesday, August 01, 2012 08:20:04 Philippe Sigaud wrote:
> OK. It'll more difficult to genericize, then, but that's not your
> problem (could be mine, though).

It was never intended to be even vaguely generic. It's targeting D 
specifically. If someone can take it and make it generic when I'm done, then 
great. But it's goal is to lex D as efficiently as possible, and it'll do 
whatever it takes to do that. From how the main switch statement's cases are 
constructed though, there's a lot there which could be genericized. I 
currently have several mixins used to create them, but I'm pretty sure that I 
can generate a _lot_ of the case statements using a single mixin which just 
takes the list of symbols and their associated tokens, which I'll probably do 
before I'm done. So, I'm sure that pieces of what I'm doing could be used to 
generate a lexer for another language, but plenty of it is very specific to D.

> >> That's seem reasonable enough, but can you really store  a dstring
> >> literal in a string field?
> > 
> > Yeah. Why not? The string is the same in the source code regardless of the
> > type of the literal. The only difference is the letter tacked onto the
> > end.
> > That will be turned into the appropriate string type be the semantic
> > analyzer, but the lexer doesn't care.
> 
> I don't get it. Say I have an literal with non UTF-8 chars, how will
> it be stored inside the .str field as a string?

The literal is written in whatever encoding the range is in. If it's UTF-8, 
it's UTF-8. If it's UTF-32, it's UTF-32. UTF-8 can hold exactly the same set 
of characters that UTF-32 can. Your range could be UTF-32, but the string 
literal is supposed to be UTF-8 ultimately. Or the range could be UTF-8 when 
the literal is UTF-32. The characters themselves are in the encoding type of 
the range regardless. It's just the values that the compiler generates which 
change.

"hello world"
"hello world"c
"hello world"w
"hello world"d

are absolutely identical as far as lexing goes save for the trailing 
character. It would be the same regardless of the characters in the strings or 
the encoding used in the source file.

In either case, a lot of string literals have to be decoded (e.g if they 
contain escaped characters), so you often can't create them with a slice 
anyway, and if a range is used which isn't one of the string types, then it's 
impossible for Token's value property to use the range type whenever it can't 
use a slice. So, it's just simpliest to always use string. It may be a slight 
performance hit for lexing wstrings and dstrings, since they _could_ be both 
sliced and created as new strings (unlike other ranges), but I don't think 
that it's worth the extra complication to make it so that the string literal's 
value could be other string types, especially when lexing strings is likely to 
be the common case.

> > Basically, the lexer that I'm writing needs to be 100% compliant with the
> > D
> > spec and dmd (which means updating the spec or fixing dmd in some cases),
> > and it needs to be possible to build on top of it anything and everything
> > that dmd does that would use a lexer (even if it's not the way that dmd
> > currently does it) so that it's possible to build a fully compliant D
> > parser and compiler on top of it as well as a ddoc generator and anything
> > else that you'd need to do with a lexer for D. So, if you have any
> > questions about what my lexer does (or is supposed to do) with regards to
> > the spec, that should answer it. If my lexer doesn't match the spec or
> > dmd when I'm done (aside from specific exceptions relating to stuff like
> > deprecated symbols), then I screwed up.
> That's a lofty goal, but that would indeed be quite good to have an
> officially integrated lexer in Phobos that would (as Andrei said) "be
> it". The token spec would be the lexer.

Well, I think that that's what a lexer in Phobos _has_ to do, or it can't be 
in Phobos. And if Jacob Carlborg gets his way, dmd's frontend will eventually 
switch to using the lexer and parser from Phobos, and in that sort of 
situation, it's that much more imperative that they follow the spec exactly.

- Jonathan M Davis