The Case Against Autodecode

Minas Mina via Digitalmars-d digitalmars-d at puremagic.com
Fri May 27 15:12:57 PDT 2016


On Friday, 27 May 2016 at 20:42:13 UTC, Andrei Alexandrescu wrote:
> On 05/27/2016 03:39 PM, Dmitry Olshansky wrote:
>> On 27-May-2016 21:11, Andrei Alexandrescu wrote:
>>> On 5/27/16 10:15 AM, Chris wrote:
>>>> It has happened to me that characters like "é" return length 
>>>> == 2
>>>
>>> Would normalization make length 1? -- Andrei
>>
>> No, this is not the point of normalization.
>
> What is? -- Andrei

Here is an example about normalization.

In Unicode, the grapheme Ä is composed of two code points: A (the 
ascii A) and the ¨ character.

However, one of the goals of unicode was to be backwards to 
compatible with earlier encodings that extended ASCII (codepages).
In some codepages, Ä was an actual codepoint.

So in some cases you would have the unicode one which is two 
codepoints and the one from some codepages which would be one.

Those should be the same though, i.e compare the same. In order 
to do that, there is normalization. What is does is to _expand_ 
the single codepoint Ä into A + ¨




More information about the Digitalmars-d mailing list