Let's stop parser Hell
Jonathan M Davis
jmdavisProg at gmx.com
Wed Aug 1 11:24:11 PDT 2012
On Wednesday, August 01, 2012 19:50:10 Philippe Sigaud wrote:
> On Wed, Aug 1, 2012 at 5:45 PM, Jonathan M Davis <jmdavisProg at gmx.com>
wrote:
> > "ウェブサイト"
> > "\u30A6\u30A7\u30D6\u30B5\u30A4\u30C8"
> >
> > The encoding of the source file is irrelevant.
>
> do you mean I can do:
>
> string field = "ウェブサイト";
>
> ?
>
> Geez, just tested it, it works. even writeln(field) correctly output
> the japanese chars and dmd doesn't choke on it.
> Bang, back to state 0: I don't get how D strings work.
>From http://dlang.org/lex.html
D source text can be in one of the following formats:
* ASCII
* UTF-8
* UTF-16BE
* UTF-16LE
* UTF-32BE
* UTF-32LE
So, yes, you can stick unicode characters directly in D code. Though I wonder
about the correctness of the spec here. It claims that if there's no BOM, then
it's ASCII, but unless vim inserts BOM markers into all of my .d files, none of
them have BOM markers, but I can put unicode in a .d file just fine with vim. U
should probably study up on BOMs.
In any case, the source is read in whatever encoding it's in. String literals
then all become UTF-8 in the final object code unless they're marked as
specifically being another type via the postfix letter or they're inferred as
being another type (e.g. when you assign a string literal to a dstring).
Regardless, what's in the final object code is based on the types that the type
system marks strings as, not what the encoding of the source code was.
So, a lexer shouldn't care about what the encoding of the source is beyond
what it takes to covert it to a format that it can deal with and potentially
being written in a way which makes handling a particular encoding more
efficient. The values of literals and the like are completely unaffected
regardless.
- Jonathan M Davis
More information about the Digitalmars-d
mailing list