Dicebot on leaving D: It is anarchy driven development in all its glory.
Chris
wendlec at tcd.ie
Thu Sep 6 11:19:14 UTC 2018
On Thursday, 6 September 2018 at 10:44:45 UTC, Joakim wrote:
[snip]
>
> You're not being fair here, Chris. I just saw this SO question
> that I think exemplifies how most programmers react to Unicode:
>
> "Trying to understand the subtleties of modern Unicode is
> making my head hurt. In particular, the distinction between
> code points, characters, glyphs and graphemes - concepts which
> in the simplest case, when dealing with English text using
> ASCII characters, all have a one-to-one relationship with each
> other - is causing me trouble.
>
> Seeing how these terms get used in documents like Matthias
> Bynens' JavaScript has a unicode problem or Wikipedia's piece
> on Han unification, I've gathered that these concepts are not
> the same thing and that it's dangerous to conflate them, but
> I'm kind of struggling to grasp what each term means.
>
> The Unicode Consortium offers a glossary to explain this stuff,
> but it's full of "definitions" like this:
>
> Abstract Character. A unit of information used for the
> organization, control, or representation of textual data. ...
>
> ...
>
> Character. ... (2) Synonym for abstract character. (3) The
> basic unit of encoding for the Unicode character encoding. ...
>
> ...
>
> Glyph. (1) An abstract form that represents one or more glyph
> images. (2) A synonym for glyph image. In displaying Unicode
> character data, one or more glyphs may be selected to depict a
> particular character.
>
> ...
>
> Grapheme. (1) A minimally distinctive unit of writing in the
> context of a particular writing system. ...
>
> Most of these definitions possess the quality of sounding very
> academic and formal, but lack the quality of meaning anything,
> or else defer the problem of definition to yet another glossary
> entry or section of the standard.
>
> So I seek the arcane wisdom of those more learned than I. How
> exactly do each of these concepts differ from each other, and
> in what circumstances would they not have a one-to-one
> relationship with each other?"
> https://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme
>
> Honestly, unicode is a mess, and I believe we will all have to
> dump the Unicode standard and start over one day. Until that
> fine day, there is no neat solution to how to handle it, no
> matter how much you'd like to think so. Also, much of the
> complexity actually comes from the complexity of the various
> language alphabets, so that cannot be waved away no matter what
> standard you come up with, though Unicode certainly adds more
> unneeded complexity on top, which is why it should be dumped.
One problem imo is that they mixed the terms up: "Grapheme: A
minimally distinctive unit of writing in the context of a
particular writing system." In linguistics a grapheme is not a
single character like "á" or "g". It may also be a combination of
characters like in English spelling <sh> ("s" + "h") that maps to
a phoneme (e.g. ship, shut, shadow). In German this sound is
written as <sch> as in "Schiff" (ship) (but not always, cf. "s"
in "Stange").
Since Unicode is such a difficult beast to deal with, I'd say D
(or any PL for that matter) needs, first and foremost, a clear
policy about what's the default behavior - not ad hoc patches.
Then maybe a strategy as to how the default behavior can be
turned on and off, say for performance reasons. One way _could_
be a compiler switch to turn the default behavior on/off -unicode
or -uni or -utf8 or whatever, or maybe better a library solution
like `ustring`.
If you need high performance and checks are no issue for the most
part (web crawling, data harvesting etc), get rid of
autodecoding. Once you need to check for character/grapheme
correctness (e.g. translation tools) make it available through
something like `to!ustring`. Which ever way: be clear about it.
But don't let the unsuspecting user use `string` and get bitten
by it.
More information about the Digitalmars-d
mailing list