The Case Against Autodecode
default0 via Digitalmars-d
digitalmars-d at puremagic.com
Sun May 29 05:08:52 PDT 2016
On Sunday, 29 May 2016 at 11:47:30 UTC, Tobias Müller wrote:
> On Sunday, 29 May 2016 at 11:25:11 UTC, Chris wrote:
>> Unicode graphemes are not always the same as graphemes in
>> natural (written) languages. If <é> is composed in Unicode, it
>> is still one grapheme in a written language, not two distinct
>> characters. However, in natural languages two characters can
>> be one grapheme, as in English <sh>, it represents the sound
>> in `shower, shop, fish`. In German the same sound is
>> represented by three characters <sch> as in `Schaf` ("sheep").
>> A bit nit-picky but we should make clear that we talk about
>> "Unicode graphemes" that map to single characters on the
>> written page. But is that at all possible across all languages?
>>
>> To avoid confusion and misunderstandings we should agree on
>> the terminology first.
>
> No, this is well established terminology, you are confusing
> several things here:
>
> - A grapheme is a "character" as written on the page
> - A phoneme is a spoken "character"
> - A codepoint is the fundamental "unit" of unicode
>
> Graphemes are built from one or more codepoints.
> Phonemes are a different topic and not really covered by the
> unicode standard AFAIK. Except for the IPA notation, but these
> are again graphemes that represent phonemes.
I am pretty sure that a single grapheme in unicode does not
correspond to your notion of "character". I am pretty sure that
what you think of as a "character" is officially called "Grapheme
Cluster" not "Grapheme".
See here: http://www.unicode.org/glossary/#grapheme_cluster
More information about the Digitalmars-d
mailing list