Inconsitency

Dmitry Olshansky dmitry.olsh at gmail.com
Wed Oct 16 12:47:16 PDT 2013


16-Oct-2013 23:42, qznc пишет:
> On Wednesday, 16 October 2013 at 18:13:37 UTC, monarch_dodra wrote:
>> On Wednesday, 16 October 2013 at 13:57:01 UTC, Jacob Carlborg wrote:
>>> On 2013-10-16 14:33, qznc wrote:
>>>
>>>> It is either [U+00E4] as one code point or [a,U+0308] for two code
>>>> points. The second is "combining diaeresis" [0]. Not required, but
>>>> possible. Those combining characters [1] provide a nearly infinite
>>>> number of combinations. You can go crazy with it:
>>>> http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work
>>>>
>>>> [0] http://www.fileformat.info/info/unicode/char/0308/index.htm
>>>> [1] http://en.wikipedia.org/wiki/Combining_character
>>>
>>> Aha, now I see.
>>
>> One of the interesting points, is with "ba\u00E4r" vs "baa\u0308r",
>> you can run a replace to replace 'a' with 'o'. Then, you'll get:
>> "boär" vs "boör"
>>
>> Which is the correct behavior? There is no correct answer.
>>
>> So while a grapheme should never be separated from it's "letter" (eg,
>> sorting "oäa" should *not* generate "aaö". What it *should* generate
>> is up to debate), you can't entirely consider that a letter+grapheme
>> is a single entity.
>>
>> Long story short: unicode is f***ing complicated.
>>
>> And I think D does a *damn* fine job of supporting it. In particular,
>> it does an awesome job of *teaching* the coder *what* unicode is.
>> Virtually everyone here has solid knowledge of unicode (I feel). They
>> understand, and can explain it, and can work with.
>>
>> On the other hand, I don't know many C++ coders that understand unicode.
>
> I agree with your point. Nevertheless you understanding of grapheme is
> off. U+0308 is not a grapheme.  "a\u0308" is one grapheme. U+00e4 is the
> same grapheme as "a\u0308".

s/the same/canonically equivalent/ :)

>
> http://en.wikipedia.org/wiki/Grapheme


-- 
Dmitry Olshansky


More information about the Digitalmars-d mailing list