Let's stop parser Hell

Tue Jul 31 15:38:18 PDT 2012

On Tuesday, July 31, 2012 23:39:38 Philippe Sigaud wrote:
> On Tue, Jul 31, 2012 at 11:20 PM, Jonathan M Davis <jmdavisProg at gmx.com> 
wrote:
> > On Tuesday, July 31, 2012 23:10:37 Philippe Sigaud wrote:
> >> Having std.lexer in Phobos would be quite good. With a pre-compiled lexer
> >> for D.
> > 
> > I'm actually quite far along with one now - one which is specifically
> > written and optimized for lexing D. I'll probably be done with it not too
> > long after the 2.060 release (though we'll see).
> 
> That was quick! Cool!

Yeah. Once I started on it, I made a lot of progress really quickly. There's 
still a fair bit to do (primarily having to do with literals), but it probably 
won't take all that much longer. Certainly, I'd expect to have it done within 
a couple of weeks if not sooner, unless something goes wrong.

> >Writing it has been going surprisingly
> >
> > quickly actually, and I've already found some bugs in the spec as a result
> > (some of which have been fixed, some of which I still need to create pull
> > requests for). So, regardless of what happens with my lexer, at least the
> > spec will be more accurate.
> 
> Could you please describe the kind of token it produces?
> Can it build a symbol table?
> Does it recognize all kind of strings (including q{ } ones)?
> How does it deal with comments, particularly nested ones?
> Does it automatically discard whitespaces or produce them as a token?
> I'd favor this approach, if only because wrapping the lexer in a
> filter!noWS(tokenRange) is easy.
> Does it produce a lazy range btw?

Well, it's still a work in progress, so it certainly can be adjusted as 
necessary. I intend it to fully implement the spec (and make sure that both it 
and the spec match what dmd is doing) as far as lexing goes. The idea is that 
you should be able to build a fully compliant D parser on top of it and build 
a fully compliant D compiler on top of that.

It already supports all of the comment types and several of the string literal 
types. I haven't sorted out q{} yet, but I will before I'm done, and that may 
or may not affect how some things work, since I'm not quite sure how to handle 
q{} yet (it may end up being done with tokens marking the beginning and end of 
the token sequence encompassed by q{}, but we'll see). I'm in the middle of 
dealing with the named entity stuff at the moment, which unfortunately has 
revealed a rather nasty compiler bug with regards to template compile times, 
which I still need to report (I intend to do that this evening). The file 
generating the table of named entities currently takes over 6 minutes to 
compile on my Phenom II thanks to that bug, so I'm not quite sure how that's 
going to affect things. Regardless, the lexer should support _everything_ as 
far as what's required for fully lexing D goes by the time that I'm done.

I don't have the code with me at the moment, but I believe that the token type 
looks something like

struct Token
{
 TokenType type;
 string str;
 LiteralValue value;
 SourcePos pos;
}

struct SourcePos
{
 size_t line;
 size_ col;
 size_t tabWidth = 8;
}

The type is an enum which gives the type of the token (obviously) which 
includes the various comment types and an error type (so errors are reported 
by getting a token that was an error token rather than throwing or anything 
like that, which should make lexing passed malformed stuff easy).

str holds the exact text which was lexed (this includes the entire comment for 
the various comment token types), which in the case of lexing string rather 
than another range type would normally (always? - I don't remember) be a slice 
of the string being lexed, which should help make lexing string very efficient. 
It may or may not make sense to change that to the range type used rather than 
string. For nesting block comments, the whole comment is one token (with the 
token type which is specifically for nested comments), regardless of whether 
there's any nesting going on. But that could be changed if there were a need 
to get separate tokens for the comments inside.

LiteralValue is a VariantN of the types that a literal can be (long, ulong, 
real, and string IIRC) and is empty unless the token is a literal type (the 
various string postfixes - c,w, and d - are treated as different token types 
rather than giving the literal value different string types - the same with the 
integral and floating point literals).

And pos holds the position in the text where the token started, which should 
make it easy to use for syntax highlighting and the like (as well as 
indicating where an error occurred). The initial position is passed as an 
optional argument to the lexing function, so it doesn't have to be 1:1 (though 
that's the default), and it allows you to select the tabwidth.

So, you'll pass a range and an optional starting position to the lexing 
function, and it'll return a lazy range of Tokens. Whitespace is stripped as 
part of the lexing process, but if you take the pos properties of two adjacent 
tokens, you should be able to determine how much whitespace was between them.

I _think_ that that's how it currently is, but again, I don't have the code 
with me at the moment, so it may not be 100% correct. And since it's a work in 
progress, it's certainly open to changes.

- Jonathan M Davis