Inconsitency

Peter Alexander peter.alexander.au at gmail.com
Sun Oct 13 11:10:39 PDT 2013


On Sunday, 13 October 2013 at 16:31:58 UTC, nickles wrote:
>> However, it could also yield the first code unit of the umlaut 
>> diacritic, depending on how the string is represented.
>
> This is not true for UTF-8, which is not subject to "endianism".

You are correct in that UTF-8 is endian agnostic, but I don't
believe that was Sönke's point. The point is that ä can be
produced in Unicode in more than one way. This program
illustrates:

import std.stdio;
void main()
{
       string a = "ä";
       string b = "a\u0308";
       writeln(a);
       writeln(b);
       writeln(cast(ubyte[])a);
       writeln(cast(ubyte[])b);
}

This prints:

ä
ä
[195, 164]
[97, 204, 136]

Notice that they are both the same "character" but have different
representations. The first is just the 'ä' code point, which
consists of two code units, the second is the 'a' code point
followed by a Combining Diaeresis code point.

In short, the string "ä" could be either 2 or 3 code units, and
either 1 or 2 code points.


More information about the Digitalmars-d mailing list