Let's stop parser Hell

Wed Aug 1 05:44:29 PDT 2012

On Wed, Aug 1, 2012 at 8:39 AM, Jonathan M Davis <jmdavisProg at gmx.com> wrote:

> It was never intended to be even vaguely generic. It's targeting D
> specifically. If someone can take it and make it generic when I'm done, then
> great. But it's goal is to lex D as efficiently as possible, and it'll do
> whatever it takes to do that.

That's exactly what I had in mind. Anyway, we need a D lexer. We also
need a generic lexer generator, but as a far-away second choice and we
can admit it being less efficient. Of course, any trick used on the D
lexer can most probably be reused for Algol-family lexers.

>> I don't get it. Say I have an literal with non UTF-8 chars, how will
>> it be stored inside the .str field as a string?
>
> The literal is written in whatever encoding the range is in. If it's UTF-8,
> it's UTF-8. If it's UTF-32, it's UTF-32. UTF-8 can hold exactly the same set
> of characters that UTF-32 can. Your range could be UTF-32, but the string
> literal is supposed to be UTF-8 ultimately. Or the range could be UTF-8 when
> the literal is UTF-32. The characters themselves are in the encoding type of
> the range regardless. It's just the values that the compiler generates which
> change.
>
> "hello world"
> "hello world"c
> "hello world"w
> "hello world"d
>
> are absolutely identical as far as lexing goes save for the trailing
> character. It would be the same regardless of the characters in the strings or
> the encoding used in the source file.

Everytime I think I understand D strings, you prove me wrong. So, I
*still* don't get how that works:

say I have

auto s = " - some greek or chinese chars, mathematical symbols, whatever - "d;

Then, the "..." part is lexed as a string literal. How can the string
field in the Token magically contain UTF32 characters? Or, are they
automatically cut in four nonsense chars each? What about comments
containing non-ASCII chars? How can code coming after the lexer make
sense of it?

As Jacob say, many people code in English. That's right, but

1- they most probably use their own language for internal documentation
2- any in8n part of a code base will have non-ASCII chars
3- D is supposed to accept UTF-16 and UTF-32 source code.

So, wouldn't it make sense to at least provide an option on the lexer
to specifically store identifier lexemes and comments as a dstring?