How to correctly deal with unicode strings?

Dicebot public at dicebot.lv
Wed Nov 27 06:54:19 PST 2013


D strings have dual nature. They behave as arrays of code units 
when slicing or accessing .length directly (because of O(1) 
guarantees for those operations) but all algorithms in standard 
library work with them as with arrays of dchar:

import std.algorithm;
import std.range : walkLength, take;
import std.array : array;

void main(string[] args)
{
	char[] x = "noël".dup;

	assert(x.length == 6);
	assert(x.walkLength == 5); // ë is two symbols on my machine

	assert(x[0 .. 3] == "noe".dup); // Actual.
	assert(array(take(x, 4)) == "noë"d);

	x.reverse;

	assert(x == "l̈eon".dup); // Actual and correct!
}

Problem you have here is that ë can be represented as two 
separate Unicode code points despite being single drawn symbol. 
It has nothing to do with strings as arrays of code units, using 
array of `dchar` will result in same behavior.


More information about the Digitalmars-d-learn mailing list