The Case Against Autodecode

Tue May 31 10:33:51 PDT 2016

On Friday, May 27, 2016 16:41:09 Andrei Alexandrescu via Digitalmars-d wrote:
> On 05/27/2016 03:43 PM, H. S. Teoh via Digitalmars-d wrote:
> > That's what we've been trying to say all along!
>
> If that's the case things are pretty dire, autodecoding or not. -- Andrei

True enough. Correctly handling Unicode in the general case is ridiculously
hard - especially if you want to be efficient. We could do everything at
the grapheme level to get the correctness, but we'd be so slow that it would
be ridiculous.

Fortunately, many string algorithms really don't need to care much about
Unicode so long as the strings involved are normalized. For instance, a
function like find can usually compare code units without decoding anything
(though even then, depending on the normalization, you run the risk of
finding a part of a character if it involves combining code points - e.g.
searching for e could give you the first part of é if its encoded with the e
followed by the accent).

But ultimately, fully correct string handling requires having a far better
understanding of Unicode than most programmers have. Even the percentage of
programmers here that have that level of understanding isn't all that great
- though the fact that D supports UTF-8, UTF-16, and UTF-32 the way that it
does has led a number of us to dig further into Unicode and learn it better
in ways that we probably wouldn't have if all it had was char. It highlights
that there is something that needs to be learned to get this right in a way
that most languages don't.

- Jonathan M Davis