The Case Against Autodecode

Jonathan M Davis via Digitalmars-d digitalmars-d at puremagic.com
Fri May 13 14:46:28 PDT 2016


On Friday, May 13, 2016 12:52:13 Kagamin via Digitalmars-d wrote:
> On Friday, 13 May 2016 at 10:38:09 UTC, Jonathan M Davis wrote:
> > IIRC, Andrei talked in TDPL about how Java's choice to go with
> > UTF-16 was worse than the choice to go with UTF-8, because it
> > was correct in many more cases
>
> UTF-16 was a migration from UCS-2, and UCS-2 was superior at the
> time.

The history of why UTF-16 was chosen isn't really relevant to my point
(Win32 has the same problem as Java and for similar reasons).

My point was that if you use UTF-8, then it's obvious _really_ fast when you
screwed up Unicode-handling by treating a code unit as a character, because
anything beyond ASCII is going to fall flat on its face. But with UTF-16, a
_lot_ more code units are representable as a single code point - as well as
a single grapheme - so it's far easier to write code that treats a code unit
as if it were a full character without realizing that you're screwing it up.
UTF-8 is fail-fast in this regard, whereas UTF-16 is not.

UTF-32 takes that problem to a new level, because now you'll only notice
problems when you're dealing with a grapheme constructed of multiple code
points. So, odds are that even if you test with Unicode strings, you won't
catch the bugs. It'll work 99% of the time, and you'll get subtle bugs the
rest of the time.

There are reasons to operate at the code point level, but in general, you
either want to be operating at the code unit level or the grapheme level,
not the code point level, and if you don't know what you're doing, then
anything other than the grapheme level is likely going to be wrong if you're
manipulating individual characters. Fortunately, a lot of string processing
doesn't need to operate on individual characters and as long as the standard
library functions get it right, you'll tend to be okay, but still, operating
at the code point level is almost always wrong, and it's even harder to
catch when it's wrong than when treating UTF-16 code units as characters.

- Jonathan M Davis



More information about the Digitalmars-d mailing list