Let's stop parser Hell

Wed Aug 1 11:29:45 PDT 2012

On 2012-08-01 20:24, Jonathan M Davis wrote:

> D source text can be in one of the following formats:
> * ASCII
> * UTF-8
> * UTF-16BE
> * UTF-16LE
> * UTF-32BE
> * UTF-32LE
>
> So, yes, you can stick unicode characters directly in D code. Though I wonder
> about the correctness of the spec here. It claims that if there's no BOM, then
> it's ASCII, but unless vim inserts BOM markers into all of my .d files, none of
> them have BOM markers, but I can put unicode in a .d file just fine with vim. U
> should probably study up on BOMs.
>
> In any case, the source is read in whatever encoding it's in. String literals
> then all become UTF-8 in the final object code unless they're marked as
> specifically being another type via the postfix letter or they're inferred as
> being another type (e.g. when you assign a string literal to a dstring).
> Regardless, what's in the final object code is based on the types that the type
> system marks strings as, not what the encoding of the source code was.
>
> So, a lexer shouldn't care about what the encoding of the source is beyond
> what it takes to covert it to a format that it can deal with and potentially
> being written in a way which makes handling a particular encoding more
> efficient. The values of literals and the like are completely unaffected
> regardless.

But if you read a source file which is encoded using UTF-16 you would 
need to re-encode that to store it in the "str" filed in your Token struct?

If that's the case, wouldn't it be better to make Token a template to be 
able to store all Unicode encodings without re-encoding? Although I 
don't know how if that will complicate the rest of the lexer.

-- 
/Jacob Carlborg