Let's stop parser Hell
Jonathan M Davis
jmdavisProg at gmx.com
Wed Aug 1 08:45:52 PDT 2012
On Wednesday, August 01, 2012 14:44:29 Philippe Sigaud wrote:
> Everytime I think I understand D strings, you prove me wrong. So, I
> *still* don't get how that works:
>
> say I have
>
> auto s = " - some greek or chinese chars, mathematical symbols, whatever -
> "d;
>
> Then, the "..." part is lexed as a string literal. How can the string
> field in the Token magically contain UTF32 characters?
It contains unicode. The lexer is lexing whatever encoding the source is in,
which has _nothing_ to do with the d on the end. It could be UTF-8, or UTF-16,
or UTF-32. If we supported other encodings in ranges, it could be one of
those. Which of those it is is irrelevant. As far as the value of the literal
goes, these two strings are identical:
"ウェブサイト"
"\u30A6\u30A7\u30D6\u30B5\u30A4\u30C8"
The encoding of the source file is irrelevant. By tacking a d on the end
"ウェブサイト"d
"\u30A6\u30A7\u30D6\u30B5\u30A4\u30C8"d
you're just telling the compiler that you want the value that it generates to
be in UTF-32. The source code could be in any of the supported encodings, and
the string could be held in any encoding until the object code is actually
generated.
> So, wouldn't it make sense to at least provide an option on the lexer
> to specifically store identifier lexemes and comments as a dstring?
You mean make it so that Token is
struct Token(R)
{
TokenType type;
R str;
LiteralValue value
SourcePos pos;
}
instead of
struct Token
{
TokenType type;
string str;
LiteralValue value
SourcePos pos;
}
or do you mean something else? I may do something like that, but I would point
out that if R doesn't have slicing, then that doesn't work. So, str can't
always be the same type as the original range. For ranges with no slicing, it
would have to be something else (probably either string or
typeof(takeExactly(range))). However, making str R _does_ come at the cost of
complicating code using the lexer, since instead of just using Token, you have
to worry about whether it's a Token!string, Token!dstring, etc, and whether
it's worth that complication is debatable. By far the most common use case is
to lex string, and if str is string, and R is not, then you incur the penalty
of converting R to string. So, the common use case is fast, and the uncommon
use case still works but is slower, and the user of the lexer doesn't have to
care what the original range type was.
It could go either way. I used string on first pass, but as I said, I could
change it to R later if that makes more sense. I'm not particularly hung up on
that little detail at this point, and that's probably one of the things that
can be changed reasonably easily later.
- Jonathan M Davis
More information about the Digitalmars-d
mailing list