The Case Against Autodecode

Fri Jun 3 20:03:16 PDT 2016

On 6/3/2016 6:08 PM, H. S. Teoh via Digitalmars-d wrote:
> It's not a hard concept, except that these different letters have
> lookalike forms with completely unrelated letters. Again:
>
> - Lowercase Latin m looks visually the same as lowercase Cyrillic Т in
>   cursive form. In some font renderings the two are IDENTICAL glyphs, in
>   spite of being completely different, unrelated letters.  However, in
>   non-cursive form, Cyrillic lowercase т is visually distinct.
>
> - Similarly, lowercase Cyrillic П in cursive font looks like lowercase
>   Latin n, and in some fonts they are identical glyphs. Again,
>   completely unrelated letters, yet they have the SAME VISUAL
>   REPRESENTATION.  However, in non-cursive font, lowercase Cyrillic П is
>   п, which is visually distinct from Latin n.
>
> - These aren't the only ones, either.  Other Cyrillic false friends
>   include cursive Д, which in some fonts looks like lowercase Latin g.
>   But in non-cursive font, it's д.
>
> Just given the above, it should be clear that going by visual
> representation is NOT enough to disambiguate between these different
> letters.

It works for books. Unicode invented a problem, and came up with a thoroughly 
wretched "solution" that we'll be stuck with for generations. One of those bad 
solutions is have the reader not know what a glyph actually is without pulling 
back the cover to read the codepoint. It's madness.

> By your argument, since lowercase Cyrillic Т is, visually,
> just m, it should be encoded the same way as lowercase Latin m. But this
> is untenable, because the letterform changes with a different font. So
> you end up with the unworkable idea of a font-dependent encoding.

Oh rubbish. Let go of the idea that choosing bad fonts should drive Unicode 
codepoint decisions.

> Or, to use an example closer to home, uppercase Latin O and the digit 0
> are visually identical. Should they be encoded as a single code point or
> two?  Worse, in some fonts, the digit 0 is rendered like Ø (to
> differentiate it from uppercase O). Does that mean that it should be
> encoded the same way as the Danish letter Ø?  Obviously not, but
> according to your "visual representation" idea, the answer should be
> yes.

Don't confuse fonts with code points. It'd be adequate if Unicode defined a 
canonical glyph for each code point, and let the font makers do what they wish.

>> The notion of 'case' should not be part of Unicode, as that is
>> semantic information that is beyond the scope of Unicode.
> But what should "i".toUpper return?

Not relevant to my point that Unicode shouldn't decide what "upper case" for all 
languages means, any more than Unicode should specify a font. Now when you argue 
that Unicode should make such decisions, note what a spectacularly hopeless job 
of it they've done.