Let's stop parser Hell

Tue Jul 31 23:11:21 PDT 2012

On Wednesday, August 01, 2012 07:44:33 Philippe Sigaud wrote:
> Is it based on a prioritized list of regexes?

I'm not using regexes at all. It's using string mixins to reduce code 
duplication, but it's effectively hand-written. If I do it right, it should be 
_very_ difficult to make it any faster than it's going to be. It even 
specifically avoids decoding unicode characters and operates on ASCII 
characters as much as possible.

> > Well, it's still a work in progress, so it certainly can be adjusted as
> > necessary. I intend it to fully implement the spec (and make sure that
> > both it and the spec match what dmd is doing) as far as lexing goes. The
> > idea is that you should be able to build a fully compliant D parser on
> > top of it and build a fully compliant D compiler on top of that.
> 
> My fear (having encoded the D grammar once) is that 99% of the code
> constructs are easily done (even new stuff like the => syntax), but
> that the remaining 1% is a nightmare. The many different kind of
> strings with or without escaping, for example. Your lexer return a
> string literal as a string field, right? If so, escaping chars like "
> and ` makes me nervous.

Well, if you want the original text, there's the str property, which contains 
everything that was in the source code (including whatever escaping there 
was), whereas the value property includes only the body, and it will have 
already processed whatever escaped characters needed to be processed. It's the 
actual value of the literal. So, \` would be ` in the value property, named 
entities would be their actual code points, etc.

So, I don't really see a problem there. Sure, depending on what is done with 
the processed string, it could cause further problems, but that always happens 
when doing string processing.

> > It already supports all of the comment types and several of the string
> > literal types. I haven't sorted out q{} yet, but I will before I'm done,
> > and that may or may not affect how some things work, since I'm not quite
> > sure how to handle q{} yet (it may end up being done with tokens marking
> > the beginning and end of the token sequence encompassed by q{}, but we'll
> > see).
> 
> I see. My question was because, as I said before I found this to be a
> difficult and hairy part of the D grammar, and an IDE might want to
> colorize/manipulate the content of a string, when it knows its a
> string mixin.

Well, it could very well cause me trouble. I haven't gotten to it yet, and 
I'll need to study it to fully understand what it does before crafting my 
solution, but it'll probably end up being a sequence of tokens of some kind 
given that that's what it is - a string of tokens. Regardless, it'll be done 
in a way that whatever is using the lexer will be able to know what it is and 
process is it accordingly.

> > str holds the exact text which was lexed (this includes the entire comment
> > for the various comment token types), which in the case of lexing string
> > rather than another range type would normally (always? - I don't
> > remember) be a slice of the string being lexed, which should help make
> > lexing string very efficient.
> Yeahs, you may as well use this excellent feature. But that means the
> input must be held in memory always, no? If it's an input range
> (coming from a big file), can you keep slices from it?

Well, whatever is using the lexer is going to have to make sure that what it 
passes to the lexer continues to exist while it uses the lexer. Given how 
slicing works in D, it's pretty much a given that lexers and parsers are going 
to use them heavily, and any code using a lexer or parser is going to have to 
take that into account. Slicing is one of the major reasons that Tango's XML 
parser is so insanely fast.

> > LiteralValue is a VariantN of the types that a literal can be (long,
> > ulong,
> > real, and string IIRC) and is empty unless the token is a literal type
> 
> You can make it an Algebraic!(long, ulong, real, string), as your
> universe of type is restricted. That can even catch a non-legal value.

I'll have to look at that to see whether using Algebraic is better. I'm not 
super-familiar with std.variant, so I may have picked the wrong type. However, 
VariantN already holds a specific set of types though (unlike Variant), so that 
part isn't a problem.

> Arrays literals are not considered literals, since they can contain
> arbitrary expressions, is that right?

There is no concept of array literal in the lexing portion of the spec. There 
may be at the level of the parser, but that's not the lexer's problem.

> And it's a good idea to drop the complex literals once and for all.

I'm not supporting any symbols which are known to be scheduled for deprecation 
(e.g. !<> and !>=). The _only_ stuff which I'm supporting along those lines is 
to-be-deprecated keywords (e.g. volatile and delete), since they still won't 
be legal to use as identifiers. And there's a function to query a token as to 
whether it's using a deprecated or unused keyword so that the program using 
the lexer can flag it if it wants to.

> > (the
> > various string postfixes - c,w, and d - are treated as different token
> > types rather than giving the literal value different string types - the
> > same with the integral and floating point literals).
> 
> That's seem reasonable enough, but can you really store  a dstring
> literal in a string field?

Yeah. Why not? The string is the same in the source code regardless of the 
type of the literal. The only difference is the letter tacked onto the end. 
That will be turned into the appropriate string type be the semantic analyzer, 
but the lexer doesn't care.

> Does syntax highlighting need more that a token stream? Without having
> thought a lot about it, it seems to me IDE tend to highlight based
> just on the token type, not on a parse tree. So that means your lexer
> can be used directly by interested people, that's nice.

If all you want to do is highlight specific symbols and keywords, then you 
don't need a parser. If you want to do fancier stuff related to templates and 
the like, then you'll need a parser and maybe even a semantic analyzer. But a 
lexer should be plenty for basic syntax highlighting.

> I really like the idea of having malformed token become error token,
> with the lexing that continue. As I said before, that enables an IDE
> to continue to do syntax highlighting.

That was the idea. You need to be able to continue lexing beyond errors in 
many cases, and that seemed the natural way to do it.

> Did you achieve UFCS, wrt floating point values? is 1.f seen as a
> float or as f called on 1 (int)?

I haven't tackled floating point literals yet, but since the f in 1.f is not 
part of the floating point literal, I'll have to handle it. I suspect that 
that's one area where I'll need to update the spec though, since it probably 
hasn't been updated for that yet.

Basically, the lexer that I'm writing needs to be 100% compliant with the D 
spec and dmd (which means updating the spec or fixing dmd in some cases), and 
it needs to be possible to build on top of it anything and everything that dmd 
does that would use a lexer (even if it's not the way that dmd currently does 
it) so that it's possible to build a fully compliant D parser and compiler on 
top of it as well as a ddoc generator and anything else that you'd need to do 
with a lexer for D. So, if you have any questions about what my lexer does (or 
is supposed to do) with regards to the spec, that should answer it. If my 
lexer doesn't match the spec or dmd when I'm done (aside from specific 
exceptions relating to stuff like deprecated symbols), then I screwed up.

- Jonathan M Davis