The Case Against Autodecode
Alix Pexton via Digitalmars-d
digitalmars-d at puremagic.com
Sat Jun 4 02:45:37 PDT 2016
On 03/06/2016 20:12, Dmitry Olshansky wrote:
> On 02-Jun-2016 23:27, Walter Bright wrote:
>> I wonder what rationale there is for Unicode to have two different
>> sequences of codepoints be treated as the same. It's madness.
>
> Yeah, Unicode was not meant to be easy it seems. Or this is whatever
> happens with evolutionary design that started with "everything is a
> 16-bit character".
>
Typing as someone who as spent some time creating typefaces, having two
representations makes sense, and it didn't start with Unicode, it
started with movable type.
It is much easier for a font designer to create the two codepoint
versions of characters for most instances, i.e. make the base letters
and the diacritics once. Then what I often do is make single codepoint
versions of the ones I'm likely to use, but only if they need more
tweaking than the kerning options of the font format allow. I'll omit
the history lesson on how this was similar in the case of movable type.
Keyboards for different languages mean that a character that is a single
keystroke in one case is two together or in sequence in another. This
means that Unicode not only represents completed strings, but also those
that are mid composition. The ordering that it uses to ensure that
graphemes have a single canonical representation is based on the order
that those multi-key characters are entered. I wouldn't call it elegant,
but its not inelegant either.
Trying to represent all sufficiently similar glyphs with the same
codepoint would lead to a layout problem. How would you order them so
that strings of any language can be sorted by their local sorting rules,
without having to special case algorithms?
Also consider ligatures, such as those for "ff", "fi", "ffi", "fl",
"ffl" and many, many more. Typographers create these glyphs whenever
available kerning tools do a poor job of combining them from the
individual glyphs. From the point of view of meaning they should still
be represented as individual codepoints, but for display (electronic or
print) that sequence needs to be replaced with the single codepoint for
the ligature.
I think that in order to understand the decisions of the Unicode
committee, one has to consider that they are trying to unify the
concerns of representing written information from two sides. One side
prioritises storage and manipulation, while the other considers
aesthetics and design workflow more important. My experience of using
Unicode from both sides gives me a different appreciation for the
difficulties of reconciling the two.
A...
P.S.
Then they started adding emojis, and I lost all faith in humanity ;)
More information about the Digitalmars-d
mailing list