Let's stop parser Hell

Wed Aug 1 08:45:52 PDT 2012

On Wednesday, August 01, 2012 14:44:29 Philippe Sigaud wrote:
> Everytime I think I understand D strings, you prove me wrong. So, I
> *still* don't get how that works:
> 
> say I have
> 
> auto s = " - some greek or chinese chars, mathematical symbols, whatever -
> "d;
> 
> Then, the "..." part is lexed as a string literal. How can the string
> field in the Token magically contain UTF32 characters?

It contains unicode. The lexer is lexing whatever encoding the source is in, 
which has _nothing_ to do with the d on the end. It could be UTF-8, or UTF-16, 
or UTF-32. If we supported other encodings in ranges, it could be one of 
those. Which of those it is is irrelevant. As far as the value of the literal 
goes, these two strings are identical:

"ウェブサイト"
"\u30A6\u30A7\u30D6\u30B5\u30A4\u30C8"

The encoding of the source file is irrelevant. By tacking a d on the end

"ウェブサイト"d
"\u30A6\u30A7\u30D6\u30B5\u30A4\u30C8"d

you're just telling the compiler that you want the value that it generates to 
be in UTF-32. The source code could be in any of the supported encodings, and 
the string could be held in any encoding until the object code is actually 
generated.

> So, wouldn't it make sense to at least provide an option on the lexer
> to specifically store identifier lexemes and comments as a dstring?

You mean make it so that Token is 

struct Token(R)
{
    TokenType    type;
    R       str;
    LiteralValue value
    SourcePos    pos;
}

instead of

struct Token
{
    TokenType    type;
    string       str;
    LiteralValue value
    SourcePos    pos;
}

or do you mean something else? I may do something like that, but I would point 
out that if R doesn't have slicing, then that doesn't work. So, str can't 
always be the same type as the original range. For ranges with no slicing, it 
would have to be something else (probably either string or 
typeof(takeExactly(range))). However, making str R _does_ come at the cost of 
complicating code using the lexer, since instead of just using Token, you have 
to worry about whether it's a Token!string, Token!dstring, etc, and whether 
it's worth that complication is debatable. By far the most common use case is 
to lex string, and if str is string, and R is not, then you incur the penalty 
of converting R to string. So, the common use case is fast, and the uncommon 
use case still works but is slower, and the user of the lexer doesn't have to 
care what the original range type was.

It could go either way. I used string on first pass, but as I said, I could 
change it to R later if that makes more sense. I'm not particularly hung up on 
that little detail at this point, and that's probably one of the things that 
can be changed reasonably easily later.

- Jonathan M Davis