The Case Against Autodecode
Tobias Müller via Digitalmars-d
digitalmars-d at puremagic.com
Sun May 29 04:47:30 PDT 2016
On Sunday, 29 May 2016 at 11:25:11 UTC, Chris wrote:
> Unicode graphemes are not always the same as graphemes in
> natural (written) languages. If <é> is composed in Unicode, it
> is still one grapheme in a written language, not two distinct
> characters. However, in natural languages two characters can be
> one grapheme, as in English <sh>, it represents the sound in
> `shower, shop, fish`. In German the same sound is represented
> by three characters <sch> as in `Schaf` ("sheep"). A bit
> nit-picky but we should make clear that we talk about "Unicode
> graphemes" that map to single characters on the written page.
> But is that at all possible across all languages?
>
> To avoid confusion and misunderstandings we should agree on the
> terminology first.
No, this is well established terminology, you are confusing
several things here:
- A grapheme is a "character" as written on the page
- A phoneme is a spoken "character"
- A codepoint is the fundamental "unit" of unicode
Graphemes are built from one or more codepoints.
Phonemes are a different topic and not really covered by the
unicode standard AFAIK. Except for the IPA notation, but these
are again graphemes that represent phonemes.
More information about the Digitalmars-d
mailing list