Inconsitency

monarch_dodra monarchdodra at gmail.com
Wed Oct 16 11:13:35 PDT 2013


On Wednesday, 16 October 2013 at 13:57:01 UTC, Jacob Carlborg 
wrote:
> On 2013-10-16 14:33, qznc wrote:
>
>> It is either [U+00E4] as one code point or [a,U+0308] for two 
>> code
>> points. The second is "combining diaeresis" [0]. Not required, 
>> but
>> possible. Those combining characters [1] provide a nearly 
>> infinite
>> number of combinations. You can go crazy with it:
>> http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work
>>
>> [0] http://www.fileformat.info/info/unicode/char/0308/index.htm
>> [1] http://en.wikipedia.org/wiki/Combining_character
>
> Aha, now I see.

One of the interesting points, is with "ba\u00E4r" vs 
"baa\u0308r", you can run a replace to replace 'a' with 'o'. 
Then, you'll get: "boär" vs "boör"

Which is the correct behavior? There is no correct answer.

So while a grapheme should never be separated from it's "letter" 
(eg, sorting "oäa" should *not* generate "aaö". What it *should* 
generate is up to debate), you can't entirely consider that a 
letter+grapheme is a single entity.

Long story short: unicode is f***ing complicated.

And I think D does a *damn* fine job of supporting it. In 
particular, it does an awesome job of *teaching* the coder *what* 
unicode is. Virtually everyone here has solid knowledge of 
unicode (I feel). They understand, and can explain it, and can 
work with.

On the other hand, I don't know many C++ coders that understand 
unicode.


More information about the Digitalmars-d mailing list