The Case Against Autodecode

Sun May 29 05:08:52 PDT 2016

On Sunday, 29 May 2016 at 11:47:30 UTC, Tobias Müller wrote:
> On Sunday, 29 May 2016 at 11:25:11 UTC, Chris wrote:
>> Unicode graphemes are not always the same as graphemes in 
>> natural (written) languages. If <é> is composed in Unicode, it 
>> is still one grapheme in a written language, not two distinct 
>> characters. However, in natural languages two characters can 
>> be one grapheme, as in English <sh>, it represents the sound 
>> in `shower, shop, fish`. In German the same sound is 
>> represented by three characters <sch> as in `Schaf` ("sheep"). 
>> A bit nit-picky but we should make clear that we talk about 
>> "Unicode graphemes" that map to single characters on the 
>> written page. But is that at all possible across all languages?
>>
>> To avoid confusion and misunderstandings we should agree on 
>> the terminology first.
>
> No, this is well established terminology, you are confusing 
> several things here:
>
> - A grapheme is a "character" as written on the page
> - A phoneme is a spoken "character"
> - A codepoint is the fundamental "unit" of unicode
>
> Graphemes are built from one or more codepoints.
> Phonemes are a different topic and not really covered by the 
> unicode standard AFAIK. Except for the IPA notation, but these 
> are again graphemes that represent phonemes.

I am pretty sure that a single grapheme in unicode does not 
correspond to your notion of "character". I am pretty sure that 
what you think of as a "character" is officially called "Grapheme 
Cluster" not "Grapheme".

See here: http://www.unicode.org/glossary/#grapheme_cluster