The Case Against Autodecode

Tue May 31 11:30:08 PDT 2016

On 5/31/16 2:11 PM, Jonathan M Davis via Digitalmars-d wrote:
> On Tuesday, May 31, 2016 13:21:57 Andrei Alexandrescu via Digitalmars-d wrote:
>> On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:
>>> Saying that operating at the code point level - UTF-32 - is correct
>>> is like saying that operating at UTF-16 instead of UTF-8 is correct.
>>
>> Could you please substantiate that? My understanding is that code unit
>> is a higher-level Unicode notion independent of encoding, whereas code
>> point is an encoding-dependent representation detail. -- Andrei
>
> Okay. If you have the letter A, it will fit in one UTF-8 code unit, one
> UTF-16 code unit, and one UTF-32 code unit (so, one code point).
>
> assert("A"c.length == 1);
> assert("A"w.length == 1);
> assert("A"d.length == 1);
>
> If you have 月, then you get
>
> assert("月"c.length == 3);
> assert("月"w.length == 1);
> assert("月"d.length == 1);
>
> whereas if you have 𐀆, then you get
>
> assert("𐀆"c.length == 4);
> assert("𐀆"w.length == 2);
> assert("𐀆"d.length == 1);
>
> So, with these characters, it's clear that UTF-8 and UTF-16 don't cut it for
> holding an entire character, but it still looks like UTF-32 does.

Does walkLength yield the same number for all representations?

> However,
> what about characters like é or שׂ? Notice that שׂ takes up more than one code
> point.
>
> assert("שׂ"c.length == 4);
> assert("שׂ"w.length == 2);
> assert("שׂ"d.length == 2);
>
> It's ש with some sort of dot marker on it that they have in Hebrew, but it's
> a single character in spite of the fact that it's multiple code points. é is
> in a similar, though more complicated boat. With D, you'll get
>
> assert("é"c.length == 2);
> assert("é"w.length == 1);
> assert("é"d.length == 1);
>
> because the compiler decides to use the version of é that's a single code
> point.

Does walkLength yield the same number for all representations?

> However, Unicode is set up so that that accent can be its own code
> point and be applied to any other code point - be it an e, an a, or even
> something like the number 0. If we normalize é, we can see other
> versions of it that take up more than one code point. e.g.
>
> assert("é"d.normalize!NFC.length == 1);
> assert("é"d.normalize!NFD.length == 2);
> assert("é"d.normalize!NFKC.length == 1);
> assert("é"d.normalize!NFKD.length == 2);

Does walkLength yield the same number for all representations?

> And you can even put that accent on 0 by doing something like
>
> assert("0"d ~ "é"d.normalize!NFKD[1] == "0́"d);
>
> One or more code units combine to make a single code point, but one or more
> code points also combine to make a grapheme.

That's right. D's handling of UTF is at the code unit level (like all of 
Unicode is portably defined). If you want graphemes use byGrapheme.

It seems you destroyed your own argument, which was:

> Saying that operating at the code point level - UTF-32 - is correct
> is like saying that operating at UTF-16 instead of UTF-8 is correct.

You can't claim code units are just a special case of code points.

Andrei