The Case Against Autodecode

Tue May 31 19:17:21 PDT 2016

On Tuesday, May 31, 2016 23:36:20 Marco Leise via Digitalmars-d wrote:
> Am Tue, 31 May 2016 16:56:43 -0400
>
> schrieb Andrei Alexandrescu <SeeWebsiteForEmail at erdani.org>:
> > On 05/31/2016 03:44 PM, Jonathan M Davis via Digitalmars-d wrote:
> > > In the vast majority of cases what folks care about is full character
> >
> > How are you so sure? -- Andrei
>
> Because a full character is the typical unit of a written
> language. It's what we visualize in our heads when we think
> about finding a substring or counting characters. A special
> case of this is the reduction to ASCII where we can use code
> units in place of grapheme clusters.

Exactly. How many folks here have written code where the correct thing to do
is to search on code points? Under what circumstances is that even useful?
Code points are a mid-level abstraction between UTF-8/16 and graphemes that
are not particularly useful on their own. Yes, by using code points, we
eliminate the differences between the encodings, but how much code even
operates on multiple string types? Having all of your strings have the same
encoding fixes the consistency problem just as well as autodecoding to dchar
evereywhere does - and without the efficiency hit. Typically, folks operate
on string or char[] unless they're talking to the Windows API, in which
case, they need wchar[]. Our general recommendation is that D code operate
on UTF-8 except when it needs to operate on a different encoding because of
other stuff it has to interact with (like the Win32 API), in which case,
ideally it converts those strings to UTF-8 once they get into the D code and
operates on them as UTF-8, and anything that has to be output in a different
encoding is operated on as UTF-8 until it needs to be outputed, in which
case, it's converted to UTF-16 or whatever the target encoding is. Not
much of anyone is recommending that you use dchar[] everywhere, but that's
essentially what the range API is trying to force.

I think that it's very safe to say that the vast majority of string
processing either is looking to operate on strings as a whole or on
individual, full characters within a string. Code points are neither. While
code may play tricks with Unicode to be efficient (e.g. operating at the
code unit level where it can rather than decoding to either code points or
graphemes), or it might make assumptions about its data being ASCII-only,
aside from explicit Unicode processing code, I have _never_ seen code that
was actually looking to logically operate on only pieces of characters.
While it may operate on code units for efficiency, it's always looking to be
logically operating on string as a unit or on whole characters.

Anyone looking to operate on code points is going to need to take into
account the fact that they're not full characters, just like anyone who
operates on code units needs to take into account the fact that they're not
whole characters. Operating on code points as if they were characters -
which is exactly what D currently does with ranges - is just plain wrong.
We need to support operating at the code point level for those rare cases
where it's actually useful, but autedecoding makes no sense. It incurs a
performance penality without actually giving correct results except in those
rare cases where you want code points instead of full characters. And only
Unicode experts are ever going to want that. The average programmer who is
not super Unicode savvy doesn't even know what code points are. They're
clearly going to be looking to operate on strings as sequences of
characters, not sequences of code points. I don't see how anyone could
expect otherwise. Code points are a mid-level, Unicode abstraction that only
those who are Unicode savvy are going to know or care about, let alone want
to operate on.

- Jonathan M Davis