The Case Against Autodecode

Fri Jun 3 15:35:18 PDT 2016

On 6/3/2016 1:53 PM, H. S. Teoh via Digitalmars-d wrote:
> But if we were to encode appearance instead of logical meaning, that
> would mean the *same* lowercase Cyrillic ь would have multiple,
> different encodings depending on which font was in use.

I don't see that consequence at all.

> That doesn't
> seem like the right solution either.  Do we really want Unicode strings
> to encode font information too??

No.

>  'Cos by that argument, serif and sans
> serif letters should have different encodings, because in languages like
> Hebrew, a tiny little serif could mean the difference between two
> completely different letters.

If they are different letters, then they should have a different code point. I 
don't see why this is such a hard concept.

> And what of the Arabic and Indic scripts? They would need to encode the
> same letter multiple times, each being a variation of the physical form
> that changes depending on the surrounding context. Even the Greek sigma
> has two forms depending on whether it's at the end of a word or not --
> so should it be two code points or one?

Two. Again, why is this hard to grasp? If there is meaning in having two 
different visual representations, then they are two codepoints. If the visual 
representation is the same, then it is one codepoint. If the difference is only 
due to font selection, that it is the same codepoint.

> Besides, that still doesn't solve the problem of what "i".uppercase()
> should return. In most languages, it should return "I", but in Turkish
> it should not.
> And if we really went the route of encoding Cyrillic
> letters the same as their Latin lookalikes, we'd have a problem with
> what "m".uppercase() should return, because now it depends on which font
> is in effect (if it's a Cyrillic cursive font, the correct answer is
> "Т", if it's a Latin font, the correct answer is "M" -- the other
> combinations: who knows).  That sounds far worse than what we have
> today.

The notion of 'case' should not be part of Unicode, as that is semantic 
information that is beyond the scope of Unicode.