The Case Against Autodecode

Fri Jun 3 18:08:09 PDT 2016

On Fri, Jun 03, 2016 at 03:35:18PM -0700, Walter Bright via Digitalmars-d wrote:
> On 6/3/2016 1:53 PM, H. S. Teoh via Digitalmars-d wrote:
[...]
> > 'Cos by that argument, serif and sans serif letters should have
> > different encodings, because in languages like Hebrew, a tiny little
> > serif could mean the difference between two completely different
> > letters.
> 
> If they are different letters, then they should have a different code
> point.  I don't see why this is such a hard concept.
[...]

It's not a hard concept, except that these different letters have
lookalike forms with completely unrelated letters. Again:

- Lowercase Latin m looks visually the same as lowercase Cyrillic Т in
  cursive form. In some font renderings the two are IDENTICAL glyphs, in
  spite of being completely different, unrelated letters.  However, in
  non-cursive form, Cyrillic lowercase т is visually distinct.

- Similarly, lowercase Cyrillic П in cursive font looks like lowercase
  Latin n, and in some fonts they are identical glyphs. Again,
  completely unrelated letters, yet they have the SAME VISUAL
  REPRESENTATION.  However, in non-cursive font, lowercase Cyrillic П is
  п, which is visually distinct from Latin n.

- These aren't the only ones, either.  Other Cyrillic false friends
  include cursive Д, which in some fonts looks like lowercase Latin g.
  But in non-cursive font, it's д.

Just given the above, it should be clear that going by visual
representation is NOT enough to disambiguate between these different
letters.  By your argument, since lowercase Cyrillic Т is, visually,
just m, it should be encoded the same way as lowercase Latin m. But this
is untenable, because the letterform changes with a different font. So
you end up with the unworkable idea of a font-dependent encoding.

Similarly, since lowercase Cyrillic П is n (in cursive font), we should
encode it the same way as Latin lowercase n. But again, the letterform
changes based on font.  Your criteria of "same visual representation"
does not work outside of English.  What you imagine to be a simple,
straightforward concept is far from being simple once you're dealing
with the diverse languages and writing systems of the world.

Or, to use an example closer to home, uppercase Latin O and the digit 0
are visually identical. Should they be encoded as a single code point or
two?  Worse, in some fonts, the digit 0 is rendered like Ø (to
differentiate it from uppercase O). Does that mean that it should be
encoded the same way as the Danish letter Ø?  Obviously not, but
according to your "visual representation" idea, the answer should be
yes.

The bottomline is that uppercase O and the digit 0 represent different
LOGICAL entities, in spite of their sharing the same visual
representation.  Eventually you have to resort to representing *logical*
entities ("characters") rather than visual appearance, which is a
property of the font, and has no place in a digital text encoding.

> > Besides, that still doesn't solve the problem of what
> > "i".uppercase() should return. In most languages, it should return
> > "I", but in Turkish it should not.
> > And if we really went the route of encoding Cyrillic letters the
> > same as their Latin lookalikes, we'd have a problem with what
> > "m".uppercase() should return, because now it depends on which font
> > is in effect (if it's a Cyrillic cursive font, the correct answer is
> > "Т", if it's a Latin font, the correct answer is "M" -- the other
> > combinations: who knows).  That sounds far worse than what we have
> > today.
> 
> The notion of 'case' should not be part of Unicode, as that is
> semantic information that is beyond the scope of Unicode.

But what should "i".toUpper return?  Or are you saying the standard
library should not include such a basic function as a case-changing
function?

T

-- 
Customer support: the art of getting your clients to pay for your own
incompetence.