The Case Against Autodecode

Fri Jun 3 23:17:17 PDT 2016

On Fri, Jun 03, 2016 at 08:03:16PM -0700, Walter Bright via Digitalmars-d wrote:
> On 6/3/2016 6:08 PM, H. S. Teoh via Digitalmars-d wrote:
> > It's not a hard concept, except that these different letters have
> > lookalike forms with completely unrelated letters. Again:
> > 
> > - Lowercase Latin m looks visually the same as lowercase Cyrillic Т
> > in cursive form. In some font renderings the two are IDENTICAL
> > glyphs, in spite of being completely different, unrelated letters.
> > However, in non-cursive form, Cyrillic lowercase т is visually
> > distinct.
> > 
> > - Similarly, lowercase Cyrillic П in cursive font looks like
> > lowercase Latin n, and in some fonts they are identical glyphs.
> > Again, completely unrelated letters, yet they have the SAME VISUAL
> > REPRESENTATION.  However, in non-cursive font, lowercase Cyrillic П
> > is п, which is visually distinct from Latin n.
> > 
> > - These aren't the only ones, either.  Other Cyrillic false friends
> > include cursive Д, which in some fonts looks like lowercase Latin g.
> > But in non-cursive font, it's д.
> > 
> > Just given the above, it should be clear that going by visual
> > representation is NOT enough to disambiguate between these different
> > letters.
> 
> It works for books.

Because books don't allow their readers to change the font.

> Unicode invented a problem, and came up with a thoroughly wretched
> "solution" that we'll be stuck with for generations. One of those bad
> solutions is have the reader not know what a glyph actually is without
> pulling back the cover to read the codepoint. It's madness.

This madness already exists *without* Unicode. If you have a page with a
single glyph 'm' printed on it and show it to an English speaker, he
will say it's lowercase M. Show it to a Russian speaker, and he will say
it's lowercase Т.  So which letter is it, M or Т?

The fundamental problem is that writing systems for different languages
interpret the same letter forms differently.  In English, lowercase g
has at least two different forms that we recognize as the same letter.
However, to a Cyrillic reader the two forms are distinct, because one of
them looks like a Cyrillic letter but the other one looks foreign. So
should g be encoded as a single point or two different points?

In a similar vein, to a Cyrillic reader the glyphs т and m represent the
same letter, but to an English letter they are clearly two different
things.

If you're going to represent both languages, you cannot get away from
needing to represent letters abstractly, rather than visually.

> > By your argument, since lowercase Cyrillic Т is, visually, just m,
> > it should be encoded the same way as lowercase Latin m. But this is
> > untenable, because the letterform changes with a different font. So
> > you end up with the unworkable idea of a font-dependent encoding.
> 
> Oh rubbish. Let go of the idea that choosing bad fonts should drive
> Unicode codepoint decisions.

It's not a bad font. It's standard practice to print Cyrillic cursive
letters with different glyphs. Russian readers can read both without any
problem.  The same letter is represented by different glyphs, and
therefore the abstract letter is a more fundamental unit of meaning than
the glyph itself.

> > Or, to use an example closer to home, uppercase Latin O and the
> > digit 0 are visually identical. Should they be encoded as a single
> > code point or two?  Worse, in some fonts, the digit 0 is rendered
> > like Ø (to differentiate it from uppercase O). Does that mean that
> > it should be encoded the same way as the Danish letter Ø?  Obviously
> > not, but according to your "visual representation" idea, the answer
> > should be yes.
> 
> Don't confuse fonts with code points. It'd be adequate if Unicode
> defined a canonical glyph for each code point, and let the font makers
> do what they wish.

So should O and 0 share the same glyph or not? They're visually the same
thing, even though some fonts render them differently. What should be
the canonical shape of O vs. 0? If they are the same shape, then by your
argument they must be the same code point, regardless of what font
makers do to disambiguate them.  Good luck writing a parser that can't
tell between an identifier that begins with O vs. a number literal that
begins with 0.

The very fact that we distinguish between O and 0, independently of what
Unicode did/does, is already proof enough that going by visual
representation is inadequate.

> > > The notion of 'case' should not be part of Unicode, as that is
> > > semantic information that is beyond the scope of Unicode.
> > But what should "i".toUpper return?
> 
> Not relevant to my point that Unicode shouldn't decide what "upper
> case" for all languages means, any more than Unicode should specify a
> font. Now when you argue that Unicode should make such decisions, note
> what a spectacularly hopeless job of it they've done.

In other words toUpper and toLower does not belong in the standard
library. Great.

T

-- 
Heuristics are bug-ridden by definition. If they didn't have bugs, they'd be algorithms.