Unicode handling comparison

Jakob Ovrum jakobovrum at gmail.com
Wed Nov 27 07:43:10 PST 2013


On Wednesday, 27 November 2013 at 12:46:38 UTC, bearophile wrote:
> Through Reddit I have seen this small comparison of Unicode 
> handling between different programming languages:
>
> http://mortoray.com/2013/11/27/the-string-type-is-broken/

Most of the points are good, but the author seems to confuse 
UCS-2 with UTF-16, so the whole point about UTF-16 is plain wrong.

The author also doesn't seem to understand the Unicode 
definitions of character and grapheme, which is a shame, because 
the difference is more or less the whole point of the post.

> D+Phobos seem to fail most things (it produces BAFFLE):
> http://dpaste.dzfl.pl/a5268c435

D strings are arrays of code units and ranges of code points. The 
failure here is yours; in that you didn't use std.uni to handle 
graphemes.

On that note, I tried to use std.uni to write a simple example of 
how to correctly handle this in D, but it became apparent that 
std.uni should expose something like `byGrapheme` which lazily 
transforms a range of code points to a range of graphemes 
(probably needs a `byCodePoint` to do the converse too). The two 
extant grapheme functions, `decodeGrapheme` and `graphemeStride`, 
are *awful* for string manipulation (granted, they are probably 
perfect for text rendering).


More information about the Digitalmars-d mailing list