Dicebot on leaving D: It is anarchy driven development in all its glory.

Chris wendlec at tcd.ie
Thu Sep 6 11:19:14 UTC 2018


On Thursday, 6 September 2018 at 10:44:45 UTC, Joakim wrote:
[snip]
>
> You're not being fair here, Chris. I just saw this SO question 
> that I think exemplifies how most programmers react to Unicode:
>
> "Trying to understand the subtleties of modern Unicode is 
> making my head hurt. In particular, the distinction between 
> code points, characters, glyphs and graphemes - concepts which 
> in the simplest case, when dealing with English text using 
> ASCII characters, all have a one-to-one relationship with each 
> other - is causing me trouble.
>
> Seeing how these terms get used in documents like Matthias 
> Bynens' JavaScript has a unicode problem or Wikipedia's piece 
> on Han unification, I've gathered that these concepts are not 
> the same thing and that it's dangerous to conflate them, but 
> I'm kind of struggling to grasp what each term means.
>
> The Unicode Consortium offers a glossary to explain this stuff, 
> but it's full of "definitions" like this:
>
> Abstract Character. A unit of information used for the 
> organization, control, or representation of textual data. ...
>
> ...
>
> Character. ... (2) Synonym for abstract character. (3) The 
> basic unit of encoding for the Unicode character encoding. ...
>
> ...
>
> Glyph. (1) An abstract form that represents one or more glyph 
> images. (2) A synonym for glyph image. In displaying Unicode 
> character data, one or more glyphs may be selected to depict a 
> particular character.
>
> ...
>
> Grapheme. (1) A minimally distinctive unit of writing in the 
> context of a particular writing system. ...
>
> Most of these definitions possess the quality of sounding very 
> academic and formal, but lack the quality of meaning anything, 
> or else defer the problem of definition to yet another glossary 
> entry or section of the standard.
>
> So I seek the arcane wisdom of those more learned than I. How 
> exactly do each of these concepts differ from each other, and 
> in what circumstances would they not have a one-to-one 
> relationship with each other?"
> https://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme
>
> Honestly, unicode is a mess, and I believe we will all have to 
> dump the Unicode standard and start over one day. Until that 
> fine day, there is no neat solution to how to handle it, no 
> matter how much you'd like to think so. Also, much of the 
> complexity actually comes from the complexity of the various 
> language alphabets, so that cannot be waved away no matter what 
> standard you come up with, though Unicode certainly adds more 
> unneeded complexity on top, which is why it should be dumped.

One problem imo is that they mixed the terms up: "Grapheme: A 
minimally distinctive unit of writing in the context of a 
particular writing system." In linguistics a grapheme is not a 
single character like "á" or "g". It may also be a combination of 
characters like in English spelling <sh> ("s" + "h") that maps to 
a phoneme (e.g. ship, shut, shadow). In German this sound is 
written as <sch> as in "Schiff" (ship) (but not always, cf. "s" 
in "Stange").

Since Unicode is such a difficult beast to deal with, I'd say D 
(or any PL for that matter) needs, first and foremost, a clear 
policy about what's the default behavior - not ad hoc patches. 
Then maybe a strategy as to how the default behavior can be 
turned on and off, say for performance reasons. One way _could_ 
be a compiler switch to turn the default behavior on/off -unicode 
or -uni or -utf8 or whatever, or maybe better a library solution 
like `ustring`.

If you need high performance and checks are no issue for the most 
part (web crawling, data harvesting etc), get rid of 
autodecoding. Once you need to check for character/grapheme 
correctness (e.g. translation tools) make it available through 
something like `to!ustring`. Which ever way: be clear about it. 
But don't let the unsuspecting user use `string` and get bitten 
by it.


More information about the Digitalmars-d mailing list