Fix Phobos dependencies on autodecoding

H. S. Teoh hsteoh at quickfur.ath.cx
Fri Aug 16 17:42:42 UTC 2019


On Fri, Aug 16, 2019 at 10:01:57AM -0700, H. S. Teoh via Digitalmars-d wrote:
[...]
> How do you reconcile these two things:
> 
> (1) The encoding of a character should not be font-dependent. I.e., it
>     should encode the abstract "symbol" rather than the physical
>     rendering of said symbol.
> 
> (2) In the real world, there exist different symbols that share the same
>     glyph shape.
[...]

Or, to use a different example that stem from the same underlying issue,
let's say we take a Russian string:

	Я тебя люблю.

In a cursive font, it might look something like this:

	Я mеδя ∧юδ∧ю.

(I'm deliberately substituting various divergent Unicode characters to
make a point.)

According to your proposal, т and m ought to be encoded differently. So
that means that Cyrillic lowercase т has *two* different encodings (and
ditto with the other lookalikes).  This is obviously absurd, because
it's the SAME LETTER in Cyrillic.  Insisting that they be encoded
differently means your string encoding depends on font, which is in
itself already ridiculous, and worse yet, it means that if you're
writing a web script that accepts input from users, you have no idea
which encoding they will use when they want to write Cyrillic lowercase
т.  You end up with two strings that are logically identical, but
bitwise different because the user happened to have a font where т is
displayed as m.  Goodbye, sane substring search, goodbye sane automatic
string processing, goodbye, consistent string rendering code.

This is equivalent to saying that English capital A in serif ought to
have a different encoding from English capital A in sans serif, because
their glyph shapes are different. If you follow that route, pretty soon
you'll have a different encoding for bolded A, another encoding for
slanted A (which is different from italic A), and the combinatorial
explosion of useless redundant encodings thereof. It simply does not
make any sense.

The only sane way out of this mess is the way Unicode has taken: you
encode *not* the glyph, but the logical entity behind the glyph, i.e.,
the "symbol" as you call it, or in Unicode parlance, the code point.
Cyrillic lowercase т is a unique entity that should correspond with
exactly one code point, notwithstanding that some of its forms are
lookalikes to Latin lowercase m.  Even if the font ultimately uses
literally the same glyph to render them, they remain distinct entities
in the encoding because they are *logically different things*.

In today's age of international communications and multilingual strings,
the fact of different logical characters sharing the same rendered form
is an unavoidable, harsh reality.  You either face it and deal with it
in a sane way, or you can hold on to broken old approaches that don't
work and fade away in the rearview mirror.  Your choice. :-D


T

-- 
Без труда не выловишь и рыбку из пруда. 


More information about the Digitalmars-d mailing list