How to correctly deal with unicode strings?
Dicebot
public at dicebot.lv
Wed Nov 27 06:54:19 PST 2013
D strings have dual nature. They behave as arrays of code units
when slicing or accessing .length directly (because of O(1)
guarantees for those operations) but all algorithms in standard
library work with them as with arrays of dchar:
import std.algorithm;
import std.range : walkLength, take;
import std.array : array;
void main(string[] args)
{
char[] x = "noël".dup;
assert(x.length == 6);
assert(x.walkLength == 5); // ë is two symbols on my machine
assert(x[0 .. 3] == "noe".dup); // Actual.
assert(array(take(x, 4)) == "noë"d);
x.reverse;
assert(x == "l̈eon".dup); // Actual and correct!
}
Problem you have here is that ë can be represented as two
separate Unicode code points despite being single drawn symbol.
It has nothing to do with strings as arrays of code units, using
array of `dchar` will result in same behavior.
More information about the Digitalmars-d-learn
mailing list