The Case Against Autodecode

Tue May 31 12:44:52 PDT 2016

On Tuesday, May 31, 2016 21:20:19 Timon Gehr via Digitalmars-d wrote:
> On 31.05.2016 20:53, Jonathan M Davis via Digitalmars-d wrote:
> > On Tuesday, May 31, 2016 14:30:08 Andrei Alexandrescu via Digitalmars-d 
wrote:
> >> >On 5/31/16 2:11 PM, Jonathan M Davis via Digitalmars-d wrote:
> >>> > >On Tuesday, May 31, 2016 13:21:57 Andrei Alexandrescu via
> >>> > >Digitalmars-d
> >
> > wrote:
> >>>> > >>On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:
> >>>>> > >>>Saying that operating at the code point level - UTF-32 - is
> >>>>> > >>>correct
> >>>>> > >>>is like saying that operating at UTF-16 instead of UTF-8 is
> >>>>> > >>>correct.
> >>>> > >>
> >>>> > >>Could you please substantiate that? My understanding is that code
> >>>> > >>unit
> >>>> > >>is a higher-level Unicode notion independent of encoding, whereas
> >>>> > >>code
> >>>> > >>point is an encoding-dependent representation detail. -- Andrei
> >> >
> >> >Does walkLength yield the same number for all representations?
> >
> > walkLength treats a code point like it's a character. My point is that
> > that's incorrect behavior. It will not result in correct string processing
> > in the general case, because a code point is not guaranteed to be a
> > full character.
> > ...
>
> What's "correct"? Maybe the user intended to count the number of code
> points in order to pre-allocate a dchar[] of the correct size.
>
> Generally, I don't see how algorithms become magically "incorrect" when
> applied to utf code units.

In the vast majority of cases what folks care about is full characters,
which is not what code points are. But the fact that they want different
things in different situation just highlights the fact that just converting
everything to code points by default is a bad idea. And even worse, code
points are usually the worst choice. Many operations don't require decoding
and can be done at the code unit level, meaning that operating at the code
point level is just plain inefficient. And the vast majority of the
operations that can't operate at the code point level, then need to operate
on full characters, which means that they need to be operating at the
grapheme level. Code points are in this weird middle ground that's useful in
some cases but usually isn't what you want or need.

We need to be able to operate at the code unit level, the code point level,
and the grapheme level. But defaulting to the code point level really makes
no sense.

> > walkLength does not report the length of a character as one in all cases
> > just like length does not report the length of a character as one in all
> > cases. walkLength is counting bigger units than length, but it's still
> > counting pieces of a character rather than counting full characters.
>
> The 'length' of a character is not one in all contexts.
> The following text takes six columns in my terminal:
>
> 日本語
> 123456

Well, that's getting into displaying characters which is a whole other can
of worms, but it also highlights that assuming that the programmer wants a
particular level of unicode is not a particularly good idea and that we
should avoid converting for them without being asked, since it risks being
inefficient to no benefit.

- Jonathan M Davis