Inconsitency
monarch_dodra
monarchdodra at gmail.com
Wed Oct 16 11:13:35 PDT 2013
On Wednesday, 16 October 2013 at 13:57:01 UTC, Jacob Carlborg
wrote:
> On 2013-10-16 14:33, qznc wrote:
>
>> It is either [U+00E4] as one code point or [a,U+0308] for two
>> code
>> points. The second is "combining diaeresis" [0]. Not required,
>> but
>> possible. Those combining characters [1] provide a nearly
>> infinite
>> number of combinations. You can go crazy with it:
>> http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work
>>
>> [0] http://www.fileformat.info/info/unicode/char/0308/index.htm
>> [1] http://en.wikipedia.org/wiki/Combining_character
>
> Aha, now I see.
One of the interesting points, is with "ba\u00E4r" vs
"baa\u0308r", you can run a replace to replace 'a' with 'o'.
Then, you'll get: "boär" vs "boör"
Which is the correct behavior? There is no correct answer.
So while a grapheme should never be separated from it's "letter"
(eg, sorting "oäa" should *not* generate "aaö". What it *should*
generate is up to debate), you can't entirely consider that a
letter+grapheme is a single entity.
Long story short: unicode is f***ing complicated.
And I think D does a *damn* fine job of supporting it. In
particular, it does an awesome job of *teaching* the coder *what*
unicode is. Virtually everyone here has solid knowledge of
unicode (I feel). They understand, and can explain it, and can
work with.
On the other hand, I don't know many C++ coders that understand
unicode.
More information about the Digitalmars-d
mailing list