The Case Against Autodecode

Chris via Digitalmars-d digitalmars-d at puremagic.com
Sun May 29 05:41:50 PDT 2016


On Sunday, 29 May 2016 at 11:47:30 UTC, Tobias Müller wrote:
> On Sunday, 29 May 2016 at 11:25:11 UTC, Chris wrote:
>> Unicode graphemes are not always the same as graphemes in 
>> natural (written) languages. If <é> is composed in Unicode, it 
>> is still one grapheme in a written language, not two distinct 
>> characters. However, in natural languages two characters can 
>> be one grapheme, as in English <sh>, it represents the sound 
>> in `shower, shop, fish`. In German the same sound is 
>> represented by three characters <sch> as in `Schaf` ("sheep"). 
>> A bit nit-picky but we should make clear that we talk about 
>> "Unicode graphemes" that map to single characters on the 
>> written page. But is that at all possible across all languages?
>>
>> To avoid confusion and misunderstandings we should agree on 
>> the terminology first.
>
> No, this is well established terminology, you are confusing 
> several things here:
>
> - A grapheme is a "character" as written on the page
> - A phoneme is a spoken "character"
> - A codepoint is the fundamental "unit" of unicode
>
> Graphemes are built from one or more codepoints.
> Phonemes are a different topic and not really covered by the 
> unicode standard AFAIK. Except for the IPA notation, but these 
> are again graphemes that represent phonemes.

Ok, you have a point there, to be precise <sh> is a multigraph (a 
digraph)(cf. [1]). In French you can have multigraphs consisting 
of three or more characters <eau> /o/, as in Irish <aoi> => /i:/. 
However, a phoneme is not necessarily a spoken "character" as 
<sh> represents one phoneme but consists of two "characters" or 
graphemes. <th> can represent two different phonemes (voiced and 
unvoiced "th" as in `this` vs. `thorough`).

My point was that we have to be _very_ careful not to mix our 
cultural experience with written text with machine 
representations. There's bound to be confusion. That's why we 
should always make clear what we refer to when we use the words 
grapheme, character, code point etc.

[1] https://en.wikipedia.org/wiki/Grapheme


More information about the Digitalmars-d mailing list