The Case Against Autodecode

Sun May 29 04:47:30 PDT 2016

On Sunday, 29 May 2016 at 11:25:11 UTC, Chris wrote:
> Unicode graphemes are not always the same as graphemes in 
> natural (written) languages. If <é> is composed in Unicode, it 
> is still one grapheme in a written language, not two distinct 
> characters. However, in natural languages two characters can be 
> one grapheme, as in English <sh>, it represents the sound in 
> `shower, shop, fish`. In German the same sound is represented 
> by three characters <sch> as in `Schaf` ("sheep"). A bit 
> nit-picky but we should make clear that we talk about "Unicode 
> graphemes" that map to single characters on the written page. 
> But is that at all possible across all languages?
>
> To avoid confusion and misunderstandings we should agree on the 
> terminology first.

No, this is well established terminology, you are confusing 
several things here:

- A grapheme is a "character" as written on the page
- A phoneme is a spoken "character"
- A codepoint is the fundamental "unit" of unicode

Graphemes are built from one or more codepoints.
Phonemes are a different topic and not really covered by the 
unicode standard AFAIK. Except for the IPA notation, but these 
are again graphemes that represent phonemes.