Let's stop parser Hell

Wed Aug 1 07:40:46 PDT 2012

On 2012-08-01 14:44, Philippe Sigaud wrote:

> Everytime I think I understand D strings, you prove me wrong. So, I
> *still* don't get how that works:
>
> say I have
>
> auto s = " - some greek or chinese chars, mathematical symbols, whatever - "d;
>
> Then, the "..." part is lexed as a string literal. How can the string
> field in the Token magically contain UTF32 characters? Or, are they
> automatically cut in four nonsense chars each? What about comments
> containing non-ASCII chars? How can code coming after the lexer make
> sense of it?
>
> As Jacob say, many people code in English. That's right, but
>
> 1- they most probably use their own language for internal documentation
> 2- any in8n part of a code base will have non-ASCII chars
> 3- D is supposed to accept UTF-16 and UTF-32 source code.
>
> So, wouldn't it make sense to at least provide an option on the lexer
> to specifically store identifier lexemes and comments as a dstring?

I'm not quite sure how it works either. But I'm thinking like this:

The string representing what's in the source code can be either UFT-8 or 
the encoding of the file. I'm not sure if the lexer needs to re-encode 
the string if it's not in the same encoding as the file.

Then there's an other field/function that returns the processed token, 
i.e. for a token of the type "int" it will return an actual int. This 
function will return different types of string depending on the type of 
the string literal the token represents.

-- 
/Jacob Carlborg