The Case Against Autodecode

Alix Pexton via Digitalmars-d digitalmars-d at puremagic.com
Sat Jun 4 02:45:37 PDT 2016


On 03/06/2016 20:12, Dmitry Olshansky wrote:
> On 02-Jun-2016 23:27, Walter Bright wrote:

>> I wonder what rationale there is for Unicode to have two different
>> sequences of codepoints be treated as the same. It's madness.
>
> Yeah, Unicode was not meant to be easy it seems. Or this is whatever
> happens with evolutionary design that started with "everything is a
> 16-bit character".
>

Typing as someone who as spent some time creating typefaces, having two 
representations makes sense, and it didn't start with Unicode, it 
started with movable type.

It is much easier for a font designer to create the two codepoint 
versions of characters for most instances, i.e. make the base letters 
and the diacritics once. Then what I often do is make single codepoint 
versions of the ones I'm likely to use, but only if they need more 
tweaking than the kerning options of the font format allow. I'll omit 
the history lesson on how this was similar in the case of movable type.

Keyboards for different languages mean that a character that is a single 
keystroke in one case is two together or in sequence in another. This 
means that Unicode not only represents completed strings, but also those 
that are mid composition. The ordering that it uses to ensure that 
graphemes have a single canonical representation is based on the order 
that those multi-key characters are entered. I wouldn't call it elegant, 
but its not inelegant either.

Trying to represent all sufficiently similar glyphs with the same 
codepoint would lead to a layout problem. How would you order them so 
that strings of any language can be sorted by their local sorting rules, 
without having to special case algorithms?

Also consider ligatures, such as those for "ff", "fi", "ffi", "fl", 
"ffl" and many, many more. Typographers create these glyphs whenever 
available kerning tools do a poor job of combining them from the 
individual glyphs. From the point of view of meaning they should still 
be represented as individual codepoints, but for display (electronic or 
print) that sequence needs to be replaced with the single codepoint for 
the ligature.

I think that in order to understand the decisions of the Unicode 
committee, one has to consider that they are trying to unify the 
concerns of representing written information from two sides. One side 
prioritises storage and manipulation, while the other considers 
aesthetics and design workflow more important. My experience of using 
Unicode from both sides gives me a different appreciation for the 
difficulties of reconciling the two.

A...

P.S.

Then they started adding emojis, and I lost all faith in humanity ;)


More information about the Digitalmars-d mailing list