The Case Against Autodecode

Fri May 27 15:04:34 PDT 2016

On Friday, 27 May 2016 at 20:42:13 UTC, Andrei Alexandrescu wrote:
> On 05/27/2016 03:39 PM, Dmitry Olshansky wrote:
>> No, this is not the point of normalization.
>
> What is? -- Andrei

1) A grapheme may include several combining characters (such as 
diacritics) whose order is not supposed to be semantically 
significant. Normalization sorts them in a standardized way so 
that string comparisons return the expected result for graphemes 
which differ only by the internal order of their constituent 
combining code points.

2) Some graphemes (like accented latin letters) can be 
represented by a single code point OR a letter followed by a 
combining diacritic. Normalization either splits them all apart 
(NFD), or combines them whenever possible (NFC). Again, this is 
primarily intended to make things like string comparisons work as 
expected, and perhaps to simplify low-level tasks like graphical 
rendering of text.

(Disclaimer: This is an oversimplification, because nothing about 
Unicode is ever simple.)