The Case Against Autodecode

Fri May 27 12:40:14 PDT 2016

On Fri, May 27, 2016 at 02:42:27PM -0400, Andrei Alexandrescu via Digitalmars-d wrote:
> On 5/27/16 12:40 PM, H. S. Teoh via Digitalmars-d wrote:
> > Exactly. And we just keep getting stuck on this point. It seems that
> > the message just isn't getting through. The unfounded assumption
> > continues to be made that iterating by code point is somehow
> > "correct" by definition and nobody can challenge it.
> 
> Which languages are covered by code points, and which languages
> require graphemes consisting of multiple code points? How does
> normalization play into this? -- Andrei

This is a complicated issue; for a full explanation you'll probably want
to peruse the Unicode codices. For example:

	http://www.unicode.org/faq/char_combmark.html

But in brief, it's mostly a number of common European languages have
1-to-1 code point to character mapping, as well as Chinese writing.
Outside of this narrow set, you're on shaky ground.  Examples (that I
can think of, there are many others):

- Almost all Korean characters are composed of multiple code points.

- The Indic languages (which cover quite a good number of Unicode code
  pages) have ligatures that require multiple code points.

- The Thai block contains a series of combining diacritics for vowels
  and tones.

- Hebrew vowel points require multiple code points;

- A good number of native American scripts require combining marks,
  e.g., Navajo.

- International Phonetic Alphabet (primarily only for linguistic uses,
  but could be widespread because it's relevant everywhere language is
  spoken).

- Classical Greek accents (though this is less common, mostly being used
  only in academic circles).

Even within the realm of European languages and languages that use some
version of the Latin script, there is an entire block of code points in
Unicode (the U+0300 block) dedicated to combining diacritics. A good
number of combinations do not have precomposed characters.

Now as far as normalization is concerned, it only helps if a particular
combination of diacritics on a base glyph have a precomposed form. A
large number of the above languages do not have precomposed characters
simply because of the sheer number of combinations. The only reason the
CJK block actually includes a huge number of precomposed characters was
because the rules for combining the base forms are too complex to encode
compositionally. Otherwise, most languages with combining diacritics
would not have precomposed characters assigned to their respective
blocks.  In fact, a good number (all?) of precomposed Latin characters
were included in Unicode only because they existed in pre-Unicode days
and some form of compatibility was desired back when Unicode was still
not yet widely adopted.

So basically, besides a small number of languages, the idea of 1 code
point == 1 character is pretty unworkable. Especially in this day and
age of worldwide connectivity.

T

-- 
The diminished 7th chord is the most flexible and fear-instilling chord. Use it often, use it unsparingly, to subdue your listeners into submission!