The Case Against Autodecode

Sun Jun 5 11:35:14 PDT 2016

On Saturday, 4 June 2016 at 08:12:47 UTC, Walter Bright wrote:
> On 6/3/2016 11:17 PM, H. S. Teoh via Digitalmars-d wrote:
>> On Fri, Jun 03, 2016 at 08:03:16PM -0700, Walter Bright via 
>> Digitalmars-d wrote:
>>> It works for books.
>> Because books don't allow their readers to change the font.
>
> Unicode is not the font.
>
>
>> This madness already exists *without* Unicode. If you have a 
>> page with a
>> single glyph 'm' printed on it and show it to an English 
>> speaker, he
>> will say it's lowercase M. Show it to a Russian speaker, and 
>> he will say
>> it's lowercase Т.  So which letter is it, M or Т?
>
> It's not a problem that Unicode can solve. As you said, the 
> meaning is in the context. Unicode has no context, and tries to 
> solve something it cannot.
>
> ('m' doesn't always mean m in english, either. It depends on 
> the context.)
>
> Ya know, if Unicode actually solved these problems, you'd have 
> a case. But it doesn't, and so you don't :-)
>
>
>> If you're going to represent both languages, you cannot get 
>> away from
>> needing to represent letters abstractly, rather than visually.
>
> Books do visually just fine!
>
>
>> So should O and 0 share the same glyph or not? They're 
>> visually the same
>> thing,
>
> No, they're not. Not even on old typewriters where every key 
> was expensive. Even without the slash, the O tends to be fatter 
> than the 0.
>
>
>> The very fact that we distinguish between O and 0, 
>> independently of what
>> Unicode did/does, is already proof enough that going by visual
>> representation is inadequate.
>
> Except that you right now are using a font where they are 
> different enough that you have no trouble at all distinguishing 
> them without bothering to look it up. And so am I.
>
>
>> In other words toUpper and toLower does not belong in the 
>> standard
>> library. Great.
>
> Unicode and the standard library are two different things.

Even if a character in different languages share a glyph or look 
identical though, it makes sense to duplicate them with different 
code points/units/whatever.

Simple functions like isCyrillicLetter() can then do a simple 
less-than / greater-than comparison instead of having a lookup 
table to check different numeric representations scattered 
throughout the Unicode table. Functions like toUpper and toLower 
become easier to write as well (for SOME languages anyhow), it's 
simply myletter +/- numlettersinalphabet. Redundancy here is very 
helpful.

Maybe instead of Unicode they should have called it Babel... :)

"The Lord said, “If as one people speaking the same language they 
have begun to do this, then nothing they plan to do will be 
impossible for them. Come, let us go down and confuse their 
language so they will not understand each other.”"

-Jon