Let's stop parser Hell
Jonathan M Davis
jmdavisProg at gmx.com
Tue Jul 31 23:39:41 PDT 2012
On Wednesday, August 01, 2012 08:20:04 Philippe Sigaud wrote:
> OK. It'll more difficult to genericize, then, but that's not your
> problem (could be mine, though).
It was never intended to be even vaguely generic. It's targeting D
specifically. If someone can take it and make it generic when I'm done, then
great. But it's goal is to lex D as efficiently as possible, and it'll do
whatever it takes to do that. From how the main switch statement's cases are
constructed though, there's a lot there which could be genericized. I
currently have several mixins used to create them, but I'm pretty sure that I
can generate a _lot_ of the case statements using a single mixin which just
takes the list of symbols and their associated tokens, which I'll probably do
before I'm done. So, I'm sure that pieces of what I'm doing could be used to
generate a lexer for another language, but plenty of it is very specific to D.
> >> That's seem reasonable enough, but can you really store a dstring
> >> literal in a string field?
> >
> > Yeah. Why not? The string is the same in the source code regardless of the
> > type of the literal. The only difference is the letter tacked onto the
> > end.
> > That will be turned into the appropriate string type be the semantic
> > analyzer, but the lexer doesn't care.
>
> I don't get it. Say I have an literal with non UTF-8 chars, how will
> it be stored inside the .str field as a string?
The literal is written in whatever encoding the range is in. If it's UTF-8,
it's UTF-8. If it's UTF-32, it's UTF-32. UTF-8 can hold exactly the same set
of characters that UTF-32 can. Your range could be UTF-32, but the string
literal is supposed to be UTF-8 ultimately. Or the range could be UTF-8 when
the literal is UTF-32. The characters themselves are in the encoding type of
the range regardless. It's just the values that the compiler generates which
change.
"hello world"
"hello world"c
"hello world"w
"hello world"d
are absolutely identical as far as lexing goes save for the trailing
character. It would be the same regardless of the characters in the strings or
the encoding used in the source file.
In either case, a lot of string literals have to be decoded (e.g if they
contain escaped characters), so you often can't create them with a slice
anyway, and if a range is used which isn't one of the string types, then it's
impossible for Token's value property to use the range type whenever it can't
use a slice. So, it's just simpliest to always use string. It may be a slight
performance hit for lexing wstrings and dstrings, since they _could_ be both
sliced and created as new strings (unlike other ranges), but I don't think
that it's worth the extra complication to make it so that the string literal's
value could be other string types, especially when lexing strings is likely to
be the common case.
> > Basically, the lexer that I'm writing needs to be 100% compliant with the
> > D
> > spec and dmd (which means updating the spec or fixing dmd in some cases),
> > and it needs to be possible to build on top of it anything and everything
> > that dmd does that would use a lexer (even if it's not the way that dmd
> > currently does it) so that it's possible to build a fully compliant D
> > parser and compiler on top of it as well as a ddoc generator and anything
> > else that you'd need to do with a lexer for D. So, if you have any
> > questions about what my lexer does (or is supposed to do) with regards to
> > the spec, that should answer it. If my lexer doesn't match the spec or
> > dmd when I'm done (aside from specific exceptions relating to stuff like
> > deprecated symbols), then I screwed up.
> That's a lofty goal, but that would indeed be quite good to have an
> officially integrated lexer in Phobos that would (as Andrei said) "be
> it". The token spec would be the lexer.
Well, I think that that's what a lexer in Phobos _has_ to do, or it can't be
in Phobos. And if Jacob Carlborg gets his way, dmd's frontend will eventually
switch to using the lexer and parser from Phobos, and in that sort of
situation, it's that much more imperative that they follow the spec exactly.
- Jonathan M Davis
More information about the Digitalmars-d
mailing list