Unicode handling comparison
Jakob Ovrum
jakobovrum at gmail.com
Wed Nov 27 07:43:10 PST 2013
On Wednesday, 27 November 2013 at 12:46:38 UTC, bearophile wrote:
> Through Reddit I have seen this small comparison of Unicode
> handling between different programming languages:
>
> http://mortoray.com/2013/11/27/the-string-type-is-broken/
Most of the points are good, but the author seems to confuse
UCS-2 with UTF-16, so the whole point about UTF-16 is plain wrong.
The author also doesn't seem to understand the Unicode
definitions of character and grapheme, which is a shame, because
the difference is more or less the whole point of the post.
> D+Phobos seem to fail most things (it produces BAFFLE):
> http://dpaste.dzfl.pl/a5268c435
D strings are arrays of code units and ranges of code points. The
failure here is yours; in that you didn't use std.uni to handle
graphemes.
On that note, I tried to use std.uni to write a simple example of
how to correctly handle this in D, but it became apparent that
std.uni should expose something like `byGrapheme` which lazily
transforms a range of code points to a range of graphemes
(probably needs a `byCodePoint` to do the converse too). The two
extant grapheme functions, `decodeGrapheme` and `graphemeStride`,
are *awful* for string manipulation (granted, they are probably
perfect for text rendering).
More information about the Digitalmars-d
mailing list