The Case Against Autodecode

Sun Jun 5 08:44:59 PDT 2016

On Friday, June 03, 2016 15:38:38 Walter Bright via Digitalmars-d wrote:
> On 6/3/2016 2:10 PM, Jonathan M Davis via Digitalmars-d wrote:
> > Actually, I would argue that the moment that Unicode is concerned with
> > what
> > the character actually looks like rather than what character it logically
> > is that it's gone outside of its charter. The way that characters
> > actually look is far too dependent on fonts, and aside from display code,
> > code does not care one whit what the character looks like.
>
> What I meant was pretty clear. Font is an artistic style that does not
> change context nor semantic meaning. If a font choice changes the meaning
> then it is not a font.

Well, maybe I misunderstood what was being argued, but it seemed like you've
been arguing that two characters should be considered the same just because
they look similar, whereas H. S. Teoh is arguing that two characters can be
logically distinct while still looking similar and that they should be
treated as distinct in Unicode because they're logically distinct. And if
that's what's being argued, then I agree with H. S. Teoh.

I expect - at least ideally - for Unicode to contain identifiers for
characters that are distinct from whatever their visual representation might
be. Stuff like fonts then worries about how to display them, and hopefully
don't do stupid stuff like make a capital I look like a lowercase l (though
they often do, unfortunately). But if two characters in different scripts -
be they latin and cyrillic or whatever - happen to often look the same but
would be considered two different characters by humans, then I would expect
Unicode to consider them to be different, whereas if no one would reasonably
consider them to be anything but exactly the same character, then there
should only be one character in Unicode.

However, if we really have crazy stuff where subtly different visual
representations of the letter g are considered to be one character in
English and two in Russian, then maybe those should be three different
characters in Unicode so that the English text can clearly be operating on
g, whereas the Russian text is doing whatever it does with its two
characters that happen to look like g. I don't know. That sort of thing just
gets ugly. But I definitely think that Unicode characters should be made up
of what the logical characters are and leave the visual representation up to
the fonts and the like.

Now, how to deal with uppercase vs lowercase and all of that sort of stuff
is a completely separate issue IMHO, and that comes down to how the
characters are somehow logically associated with one another, and it's going
to be very locale-specific such that it's not really part of the core of
Unicode's charter IMHO (though I'm not sure that it's bad if there's a set
of locale rules that go along with Unicode for those looking to correctly
apply such rules - they just have nothing to do with code points and
graphemes and how they're represented in code).

- Jonathan M Davis