Let's stop parser Hell
Jonathan M Davis
jmdavisProg at gmx.com
Tue Jul 31 15:38:18 PDT 2012
On Tuesday, July 31, 2012 23:39:38 Philippe Sigaud wrote:
> On Tue, Jul 31, 2012 at 11:20 PM, Jonathan M Davis <jmdavisProg at gmx.com>
wrote:
> > On Tuesday, July 31, 2012 23:10:37 Philippe Sigaud wrote:
> >> Having std.lexer in Phobos would be quite good. With a pre-compiled lexer
> >> for D.
> >
> > I'm actually quite far along with one now - one which is specifically
> > written and optimized for lexing D. I'll probably be done with it not too
> > long after the 2.060 release (though we'll see).
>
> That was quick! Cool!
Yeah. Once I started on it, I made a lot of progress really quickly. There's
still a fair bit to do (primarily having to do with literals), but it probably
won't take all that much longer. Certainly, I'd expect to have it done within
a couple of weeks if not sooner, unless something goes wrong.
> >Writing it has been going surprisingly
> >
> > quickly actually, and I've already found some bugs in the spec as a result
> > (some of which have been fixed, some of which I still need to create pull
> > requests for). So, regardless of what happens with my lexer, at least the
> > spec will be more accurate.
>
> Could you please describe the kind of token it produces?
> Can it build a symbol table?
> Does it recognize all kind of strings (including q{ } ones)?
> How does it deal with comments, particularly nested ones?
> Does it automatically discard whitespaces or produce them as a token?
> I'd favor this approach, if only because wrapping the lexer in a
> filter!noWS(tokenRange) is easy.
> Does it produce a lazy range btw?
Well, it's still a work in progress, so it certainly can be adjusted as
necessary. I intend it to fully implement the spec (and make sure that both it
and the spec match what dmd is doing) as far as lexing goes. The idea is that
you should be able to build a fully compliant D parser on top of it and build
a fully compliant D compiler on top of that.
It already supports all of the comment types and several of the string literal
types. I haven't sorted out q{} yet, but I will before I'm done, and that may
or may not affect how some things work, since I'm not quite sure how to handle
q{} yet (it may end up being done with tokens marking the beginning and end of
the token sequence encompassed by q{}, but we'll see). I'm in the middle of
dealing with the named entity stuff at the moment, which unfortunately has
revealed a rather nasty compiler bug with regards to template compile times,
which I still need to report (I intend to do that this evening). The file
generating the table of named entities currently takes over 6 minutes to
compile on my Phenom II thanks to that bug, so I'm not quite sure how that's
going to affect things. Regardless, the lexer should support _everything_ as
far as what's required for fully lexing D goes by the time that I'm done.
I don't have the code with me at the moment, but I believe that the token type
looks something like
struct Token
{
TokenType type;
string str;
LiteralValue value;
SourcePos pos;
}
struct SourcePos
{
size_t line;
size_ col;
size_t tabWidth = 8;
}
The type is an enum which gives the type of the token (obviously) which
includes the various comment types and an error type (so errors are reported
by getting a token that was an error token rather than throwing or anything
like that, which should make lexing passed malformed stuff easy).
str holds the exact text which was lexed (this includes the entire comment for
the various comment token types), which in the case of lexing string rather
than another range type would normally (always? - I don't remember) be a slice
of the string being lexed, which should help make lexing string very efficient.
It may or may not make sense to change that to the range type used rather than
string. For nesting block comments, the whole comment is one token (with the
token type which is specifically for nested comments), regardless of whether
there's any nesting going on. But that could be changed if there were a need
to get separate tokens for the comments inside.
LiteralValue is a VariantN of the types that a literal can be (long, ulong,
real, and string IIRC) and is empty unless the token is a literal type (the
various string postfixes - c,w, and d - are treated as different token types
rather than giving the literal value different string types - the same with the
integral and floating point literals).
And pos holds the position in the text where the token started, which should
make it easy to use for syntax highlighting and the like (as well as
indicating where an error occurred). The initial position is passed as an
optional argument to the lexing function, so it doesn't have to be 1:1 (though
that's the default), and it allows you to select the tabwidth.
So, you'll pass a range and an optional starting position to the lexing
function, and it'll return a lazy range of Tokens. Whitespace is stripped as
part of the lexing process, but if you take the pos properties of two adjacent
tokens, you should be able to determine how much whitespace was between them.
I _think_ that that's how it currently is, but again, I don't have the code
with me at the moment, so it may not be 100% correct. And since it's a work in
progress, it's certainly open to changes.
- Jonathan M Davis
More information about the Digitalmars-d
mailing list