The Case Against Autodecode

Tue May 31 11:53:14 PDT 2016

On Tuesday, May 31, 2016 14:30:08 Andrei Alexandrescu via Digitalmars-d wrote:
> On 5/31/16 2:11 PM, Jonathan M Davis via Digitalmars-d wrote:
> > On Tuesday, May 31, 2016 13:21:57 Andrei Alexandrescu via Digitalmars-d 
wrote:
> >> On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:
> >>> Saying that operating at the code point level - UTF-32 - is correct
> >>> is like saying that operating at UTF-16 instead of UTF-8 is correct.
> >>
> >> Could you please substantiate that? My understanding is that code unit
> >> is a higher-level Unicode notion independent of encoding, whereas code
> >> point is an encoding-dependent representation detail. -- Andrei
> >
> Does walkLength yield the same number for all representations?

walkLength treats a code point like it's a character. My point is that
that's incorrect behavior. It will not result in correct string processing
in the general case, because a code point is not guaranteed to be a
full character.

walkLength does not report the length of a character as one in all cases
just like length does not report the length of a character as one in all
cases. walkLength is counting bigger units than length, but it's still
counting pieces of a character rather than counting full characters.

> > And you can even put that accent on 0 by doing something like
> >
> > assert("0"d ~ "é"d.normalize!NFKD[1] == "0́"d);
> >
> > One or more code units combine to make a single code point, but one or
> > more
> > code points also combine to make a grapheme.
>
> That's right. D's handling of UTF is at the code unit level (like all of
> Unicode is portably defined). If you want graphemes use byGrapheme.
>
> It seems you destroyed your own argument, which was:
> > Saying that operating at the code point level - UTF-32 - is correct
> > is like saying that operating at UTF-16 instead of UTF-8 is correct.
>
> You can't claim code units are just a special case of code points.

The point is that treating a code point like it's a full character is just
as wrong as treating a code unit as if it were a full character. It's _not_
guaranteed to be a full character. Treating code points as full characters
does give you the correct result in more cases than treating a code unit as
a full character gives you the correct result, but it still gives you the
wrong result in many cases. If we want to have fully correct behavior
without making the programmer deal with all of the Unicode issues
themselves, then we need to operate at the grapheme level so that we are
operating on full characters (though that obviously comes at a high cost to
efficiency).

Treating code points as characters like we do right now does not give the
correct result in the general case just like treating code units as
characters doesn't give the correct result in the general case. Both work
some of the time, but neither works all of the time.

Autodecoding attempts to hide the fact that it's operating on Unicode but
does not actually go far enough to result in correct behavior. So, we pay
the cost of decoding without getting the benefit of correctness.

- Jonathan M Davis