Inconsitency

qznc qznc at web.de
Wed Oct 16 12:42:57 PDT 2013


On Wednesday, 16 October 2013 at 18:13:37 UTC, monarch_dodra 
wrote:
> On Wednesday, 16 October 2013 at 13:57:01 UTC, Jacob Carlborg 
> wrote:
>> On 2013-10-16 14:33, qznc wrote:
>>
>>> It is either [U+00E4] as one code point or [a,U+0308] for two 
>>> code
>>> points. The second is "combining diaeresis" [0]. Not 
>>> required, but
>>> possible. Those combining characters [1] provide a nearly 
>>> infinite
>>> number of combinations. You can go crazy with it:
>>> http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work
>>>
>>> [0] 
>>> http://www.fileformat.info/info/unicode/char/0308/index.htm
>>> [1] http://en.wikipedia.org/wiki/Combining_character
>>
>> Aha, now I see.
>
> One of the interesting points, is with "ba\u00E4r" vs 
> "baa\u0308r", you can run a replace to replace 'a' with 'o'. 
> Then, you'll get: "boär" vs "boör"
>
> Which is the correct behavior? There is no correct answer.
>
> So while a grapheme should never be separated from it's 
> "letter" (eg, sorting "oäa" should *not* generate "aaö". What 
> it *should* generate is up to debate), you can't entirely 
> consider that a letter+grapheme is a single entity.
>
> Long story short: unicode is f***ing complicated.
>
> And I think D does a *damn* fine job of supporting it. In 
> particular, it does an awesome job of *teaching* the coder 
> *what* unicode is. Virtually everyone here has solid knowledge 
> of unicode (I feel). They understand, and can explain it, and 
> can work with.
>
> On the other hand, I don't know many C++ coders that understand 
> unicode.

I agree with your point. Nevertheless you understanding of 
grapheme is off. U+0308 is not a grapheme.  "a\u0308" is one 
grapheme. U+00e4 is the same grapheme as "a\u0308".

http://en.wikipedia.org/wiki/Grapheme


More information about the Digitalmars-d mailing list