Let's stop parser Hell

Wed Aug 1 11:24:11 PDT 2012

On Wednesday, August 01, 2012 19:50:10 Philippe Sigaud wrote:
> On Wed, Aug 1, 2012 at 5:45 PM, Jonathan M Davis <jmdavisProg at gmx.com> 
wrote:
> > "ウェブサイト"
> > "\u30A6\u30A7\u30D6\u30B5\u30A4\u30C8"
> > 
> > The encoding of the source file is irrelevant.
> 
> do you mean I can do:
> 
> string field = "ウェブサイト";
> 
> ?
> 
> Geez, just tested it, it works. even writeln(field) correctly output
> the japanese chars and dmd doesn't choke on it.
> Bang, back to state 0: I don't get how D strings work.

>From http://dlang.org/lex.html

D source text can be in one of the following formats: 
* ASCII
* UTF-8
* UTF-16BE
* UTF-16LE
* UTF-32BE
* UTF-32LE

So, yes, you can stick unicode characters directly in D code. Though I wonder 
about the correctness of the spec here. It claims that if there's no BOM, then 
it's ASCII, but unless vim inserts BOM markers into all of my .d files, none of 
them have BOM markers, but I can put unicode in a .d file just fine with vim. U 
should probably study up on BOMs.

In any case, the source is read in whatever encoding it's in. String literals 
then all become UTF-8 in the final object code unless they're marked as 
specifically being another type via the postfix letter or they're inferred as 
being another type (e.g. when you assign a string literal to a dstring). 
Regardless, what's in the final object code is based on the types that the type 
system marks strings as, not what the encoding of the source code was.

So, a lexer shouldn't care about what the encoding of the source is beyond 
what it takes to covert it to a format that it can deal with and potentially 
being written in a way which makes handling a particular encoding more 
efficient. The values of literals and the like are completely unaffected 
regardless.

- Jonathan M Davis