Fix Phobos dependencies on autodecoding

H. S. Teoh hsteoh at quickfur.ath.cx
Thu Aug 15 22:04:01 UTC 2019


On Thu, Aug 15, 2019 at 12:59:34PM -0700, Walter Bright via Digitalmars-d wrote:
> On 8/15/2019 12:44 PM, Jonathan M Davis wrote:
> > There should only be a single way to represent a given character.
> 
> Exactly. And two glyphs that render identically should be the same
> code point.
[...]

It's not as simple as you imagine.  Letter shapes across different
languages can look alike, but have zero correspondence with each other.
Conflating two distinct letter forms just because they happen to look
alike is the beginning of the road to madness.

First and foremost, the exact glyph shape depends on the font -- a
cursive M is a different shape from a serif upright M which is different
from a sans-serif bolded M.  They are logically the exact same
character, but they are rendered differently depending on the font.

What's the problem with that, you say?  Here's the problem: if we follow
your suggestion of identifying characters by rendered glyph, that means
a lowercase English 'u' ought to be the same character as the cursive
form of Cyrillic и (because that's how it's written in cursive).
However, non-cursive Cyrillic и is printed as и (i.e., the equivalent of
a "backwards" small-caps English N).  You cannot be seriously suggesting
that и and u should be the same character, right?!  The point is that
this changes *based on the font*; Russian speakers recognize the two
*distinct* glyphs as the SAME letter.  They also recognize that it's a
DIFFERENT letter from English u, in spite of the fact the glyphs are
identical.

This is just one of many such examples.  Yet another Cyrillic example:
lowercase cursive Т is written with a glyph that, for all practical
purposes, is identical to the glyph for English 'm'.  Again, conflating
the two based on your idea is outright ridiculous.  Just because the
user changes the font, should not mean that now the character becomes a
different letter! (Or that the program needs to rewrite all и's into
lowercase u's!)

How a letter is rendered is a question of *font*, and I'm sure you'll
agree that it doesn't make sense to make decisions on character identity
based on which font you happen to be using.

Then take an example from Chinese: the character for "one" is, once you
strip away the stylistic embellishments (which is an issue of font, and
ought not to come into play with a character encoding), basically the
same shape as a hyphen. You cannot seriously be telling me that we
should treat the two as the same thing.

Basically, there is no sane way to avoid detaching the character
encoding from the physical appearance of the character.  It simply makes
no sense to have a different character for every variation of glyph
across a set of fonts.  You *have* to work on a more abstract level, at
the level of the *logical* identity of the character, not its specific
physical appearance per font.

But that *inevitably* means you'll end up with multiple distinct
characters that happen to share the same glyph (again, modulo which font
the user selected for displaying the text).  See the Cyrillic examples
above.  There are many other examples of logically-distinct characters
from different languages that happen to share the same glyph shape with
some English letter in some cases, which you cannot possibly conflate
without ending up with nonsensical results.  You cannot eliminate
dependence on the specific font if you insist on identifying characters
by shape.  The only sane solution is to work on the abstract level,
where the same logical character (e.g., Cyrillic letter N) can have
multiple different glyphs depending on the font (in cursive, for
example, capital И looks like English U).

But once you work at the abstract level, you cannot avoid some
logically-distinct letters coinciding in glyph shape (e.g., English
lowercase u vs. Cyrillic и).  And once you start on that slippery slope,
you're not very far from descending into the "chaos" of the current
Unicode standard -- because inevitably you'll have to make distinctions
like "lowercase Greek mu as used in mathematics" vs. "lowercase Greek mu
as used by Greeks to write their language" -- because although
historically the two were identical, over time their usage has diverged
and now there exists some contexts where you have to differentiate
between the two.

The fact of the matter is that human language is inherently complex (not
to mention *changes over time* -- something many people don't consider),
and no amount of cleverness is going to surmount that without producing
an inherently-complex solution.


T

-- 
Why ask rhetorical questions? -- JC


More information about the Digitalmars-d mailing list