The Case Against Autodecode

Tue May 31 11:11:47 PDT 2016

On Tuesday, May 31, 2016 13:21:57 Andrei Alexandrescu via Digitalmars-d wrote:
> On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:
> > Saying that operating at the code point level - UTF-32 - is correct
> > is like saying that operating at UTF-16 instead of UTF-8 is correct.
>
> Could you please substantiate that? My understanding is that code unit
> is a higher-level Unicode notion independent of encoding, whereas code
> point is an encoding-dependent representation detail. -- Andrei

Okay. If you have the letter A, it will fit in one UTF-8 code unit, one
UTF-16 code unit, and one UTF-32 code unit (so, one code point).

assert("A"c.length == 1);
assert("A"w.length == 1);
assert("A"d.length == 1);

If you have 月, then you get

assert("月"c.length == 3);
assert("月"w.length == 1);
assert("月"d.length == 1);

whereas if you have 𐀆, then you get

assert("𐀆"c.length == 4);
assert("𐀆"w.length == 2);
assert("𐀆"d.length == 1);

So, with these characters, it's clear that UTF-8 and UTF-16 don't cut it for
holding an entire character, but it still looks like UTF-32 does. However,
what about characters like é or שׂ? Notice that שׂ takes up more than one code
point.

assert("שׂ"c.length == 4);
assert("שׂ"w.length == 2);
assert("שׂ"d.length == 2);

It's ש with some sort of dot marker on it that they have in Hebrew, but it's
a single character in spite of the fact that it's multiple code points. é is
in a similar, though more complicated boat. With D, you'll get

assert("é"c.length == 2);
assert("é"w.length == 1);
assert("é"d.length == 1);

because the compiler decides to use the version of é that's a single code
point. However, Unicode is set up so that that accent can be its own code
point and be applied to any other code point - be it an e, an a, or even
something like the number 0. If we normalize é, we can see other
versions of it that take up more than one code point. e.g.

assert("é"d.normalize!NFC.length == 1);
assert("é"d.normalize!NFD.length == 2);
assert("é"d.normalize!NFKC.length == 1);
assert("é"d.normalize!NFKD.length == 2);

And you can even put that accent on 0 by doing something like

assert("0"d ~ "é"d.normalize!NFKD[1] == "0́"d);

One or more code units combine to make a single code point, but one or more
code points also combine to make a grapheme. So, while there is a definite
layer of separation between code units and code points, it's still the case
that a single code point is not guaranteed to be a single character. You do
indeed have encodings with code units and not code points (though those
still have different normalizations, which is kind of like having different
encodings), but in terms of correctness, you have the same problem with
treating code points as characters that you have as treating code units as
characters. You're still not guaranteed that you're operating on full
characters and risk chopping them up. It's just that at the code point
level, you're generally chopping something up that is visually separable
(like an accent from a letter or a superscript on a symbol), whereas with
code units, you end up with utter garbage when you chop them incorrectly.

By operating at the code point level, we're correct for _way_ more
characters than we would be than if we treated char like a full character,
but we're still not fully correct, and it's a lot harder to notice when you
screw it up, because the number of characters which are handled incorrectly
is far smaller.

- Jonathan M Davis