The Case Against Autodecode

Fri Jun 3 14:10:51 PDT 2016

On Friday, June 03, 2016 03:08:43 Walter Bright via Digitalmars-d wrote:
> On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
> > At the time
> > Unicode also had to grapple with tricky issues like what to do with
> > lookalike characters that served different purposes or had different
> > meanings, e.g., the mu sign in the math block vs. the real letter mu in
> > the Greek block, or the Cyrillic A which looks and behaves exactly like
> > the Latin A, yet the Cyrillic Р, which looks like the Latin P, does
> > *not* mean the same thing (it's the equivalent of R), or the Cyrillic В
> > whose lowercase is в not b, and also had a different sound, but
> > lowercase Latin b looks very similar to Cyrillic ь, which serves a
> > completely different purpose (the uppercase is Ь, not B, you see).
>
> I don't see that this is tricky at all. Adding additional semantic meaning
> that does not exist in printed form was outside of the charter of Unicode.
> Hence there is no justification for having two distinct characters with
> identical glyphs.
>
> They should have put me in charge of Unicode. I'd have put a stop to much of
> the madness :-)

Actually, I would argue that the moment that Unicode is concerned with what
the character actually looks like rather than what character it logically is
that it's gone outside of its charter. The way that characters actually look
is far too dependent on fonts, and aside from display code, code does not
care one whit what the character looks like.

For instance, take the capital letter I, the lowercase letter l, and the
number one. In some fonts that are feeling cruel towards folks who actually
want to read them, two of those characters - or even all three of them -
look identical. But I think that you'll agree that those characters should
be represented as distinct characters in Unicode regardless of what they
happen to look like in a particular font.

Now, take a cyrllic letter that looks similar to a latin letter. If they're
logically equivalent such that no code would ever want to distinguish
between the two and such that no font would ever even consider representing
them differently, then they're truly the same letter, and they should only
have one Unicode representation. But if anyone would ever consider them to
be logically distinct, then it makes no sense for them to be considered to
be the same character by Unicode, because they don't have the same identity.
And that distinction is quite clear if any font would ever consider
representing the two characters differently, no matter how slight that
difference might be.

Really, what a character looks like has nothing to do with Unicode. The
exact same Unicode is used regardless of how the text is displayed. Rather,
what Unicode is doing is providing logical identifiers for characters so
that code can operate on them, and display code can then do whatever it does
to display those characters, whether they happen to look similar or not. I
would think that the fact that non-display code does not care one whit about
what a character looks like and that display code can have drastically
different visual representations for the same character would make it clear
that Unicode is concerned with having identifiers for logical characters and
that that is distinct from any visual representation.

- Jonathan M Davis