How to correctly deal with unicode strings?

Wed Nov 27 06:55:29 PST 2013

Beaophile also linked this article:
http://forum.dlang.org/thread/nieoqqmidngwoqwnktih@forum.dlang.org

On Wednesday, 27 November 2013 at 14:34:15 UTC, Gary Willoughby 
wrote:
> I've just been reading this article: 
> http://mortoray.com/2013/11/27/the-string-type-is-broken/ and 
> wanted to test if D performed in the same way as he describes, 
> i.e. unicode strings being 'broken' because they are just 
> arrays.

No. While unidcode string are "just" arrays, that's not why it's 
"broken". Unicode string are *stored* in arrays, as a sequence of 
"codeunits", but they are still decoded entire "codepoints" at 
once, so that's not the issue.

The main issue is that in unicode, a "character" (if that means 
anything), or a "grapheme", can be composed of two codepoints, 
that mesn't be separated. Currently, D does not know how to deal 
with this.

> Although i understand the difference between code units and 
> code points it's not entirely clear in D what i need to do to 
> avoid the situations he describes. For example:
>
> import std.algorithm;
> import std.stdio;
>
> void main(string[] args)
> {
> 	char[] x = "noël".dup;
>
> 	assert(x.length == 6); // Actual
> 	// assert(x.length == 4); // Expected.

This is a source of confusion: a string is *not* a random access 
range. This means that "length" is not actually part of the 
"string interface": It is only an "underlying implementation 
detail".

try this:
alias String = string;
static if (hasLength!String)
     assert(x.length == 4);
else
     assert(x.walkLength == 4);

This will work regardless of string's "width" (char/wchar/dchar).

> 	assert(x[0 .. 3] == "noe".dup); // Actual.
> 	// assert(x[0 .. 3] == "noë".dup); // Expected.

Again, don't slice your strings like that, a string isn't random 
access nor sliceable. You have no guarantee your third character 
will start at index 3. You want:

assert(equal(x.take(3), "noe"));

Note that "x.take(3)" will not actually give you a slice, but a 
lazy range. If you want a slice, you need to walk the string, and 
extract the index:

auto index = x.length - x.dropFront(3).length;
assert(x[0 .. index] == "noe");

Note that this is *only* "UTF-correct", but it is still wrong 
from a unicode point of view. Again, it's because ë is actually a 
single grapheme composed of *two* codepoints.

> 	x.reverse;
>
> 	assert(x == "l̈eon".dup); // Actual
> 	// assert(x == "lëon".dup); // Expected.
> }
>
> Here i understand what is happening but how could i improve 
> this example to make the expected asserts true?

AFAIK, We don't have any way of dealing with this (yet).