The Case Against Autodecode

Fri Jun 3 01:05:27 PDT 2016

On Thu, Jun 02, 2016 at 05:19:48PM -0700, Walter Bright via Digitalmars-d wrote:
> On 6/2/2016 3:27 PM, John Colvin wrote:
> > > I wonder what rationale there is for Unicode to have two different
> > > sequences of codepoints be treated as the same. It's madness.
> > 
> > There are languages that make heavy use of diacritics, often several
> > on a single "character". Hebrew is a good example. Should there be
> > only one valid ordering of any given set of diacritics on any given
> > character?
> 
> I didn't say ordering, I said there should be no such thing as
> "normalization" in Unicode, where two codepoints are considered to be
> identical to some other codepoint.

I think it was a combination of historical baggage and trying to
accomodate unusual but still valid use cases.

The historical baggage was that Unicode was trying to unify all of the
various already-existing codepages out there, and many of those
codepages already come with various precomposed characters. To maximize
compatibility with existing codepages, Unicode tried to preserve as much
of the original mappings as possible within each 256-point block, so
these precomposed characters became part of the standard.

However, there weren't enough of them -- some people demanded less
common character + diacritic combinations, and some languages had
writing so complex their characters had to be composed from more basic
parts. The original Unicode range was 16-bit, so there wasn't enough
room to fit all of the precomposed characters people demanded, plus
there were other things people wanted, like multiple diacritics (e.g.,
in IPA). So the concept of combining diacritics was invented, in part to
prevent combinatorial explosion from soaking up the available code point
space, in part to allow for novel combinations of diacritics that
somebody out there somewhere might want to represent.  However, this
meant that some precomposed characters were "redundant": they
represented character + diacritic combinations that could equally well
be expressed separately. Normalization was the inevitable consequence.
(Normalization, of course, also subsumes a few other things, such as
collation, but this is one of the factors behind it.)

(This is a greatly over-simplified description, of course. At the time
Unicode also had to grapple with tricky issues like what to do with
lookalike characters that served different purposes or had different
meanings, e.g., the mu sign in the math block vs. the real letter mu in
the Greek block, or the Cyrillic A which looks and behaves exactly like
the Latin A, yet the Cyrillic Р, which looks like the Latin P, does
*not* mean the same thing (it's the equivalent of R), or the Cyrillic В
whose lowercase is в not b, and also had a different sound, but
lowercase Latin b looks very similar to Cyrillic ь, which serves a
completely different purpose (the uppercase is Ь, not B, you see). Then
you have the wonderful Indic and Arabic cursive writings, where
letterforms mutate depending on the surrounding context, which, if you
were to include all variants as distinct code points, would occupy many
more pages than they currently do.  And also sticky issues like the
oft-mentioned Turkish i, which is encoded as a Latin i but behaves
differently w.r.t. upper/lowercasing when in Turkish locale -- some
cases of this, IIRC, are unfixable bugs in Phobos because we currently
do not handle locales. So you see, imagining that code points == the
solution to Unicode string handling is a joke. Writing correct Unicode
handling is *hard*.)

As with all sufficiently complex software projects, Unicode represents a
compromise between many contradictory factors -- writing systems in the
world being the complex, not-very-consistent beasts they are -- so such
"dirty" details are somewhat inevitable.

T

-- 
Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. -- Brian W. Kernighan