How to correctly deal with unicode strings?
Adam D. Ruppe
destructionator at gmail.com
Wed Nov 27 06:55:30 PST 2013
The normalize function in std.uni helps a lot here (well, it
would if it actually compiled, but that's an easy fix, just
import std.typecons in your copy of phobos/src/uni.d. It does so
already but only in version(unittest)! LOL)
dstrings are a bit easier to use too:
import std.algorithm;
import std.stdio;
import std.uni;
void main(string[] args)
{
dstring x = "noël"d.normalize;
assert(x.length == 4); // Expected.
assert(x[0 .. 3] == "noë"d.normalize); // Expected.
import std.range;
dstring y = x.retro.array;
assert(y == "lëon"d.normalize); // Expected.
}
All of that works. The normalize function does the character
pairs, like 'ë', into single code points. Take a gander at this:
foreach(dchar c; "noël"d)
writeln(cast(int) c);
110 // 'n'
111 // 'o'
101 // 'e'
776 // special character to add the dieresis to the preceding
character
108 // 'l'
This btw is a great example of why .length should *not* be
expected to give the number of characters, not even with
dstrings, since a code point is not necessarily the same as a
character! And, of course, with string, a code point is often not
the same as a code unit.
What the normalize function does is goes through and combines
those combining characters into one thing:
import std.uni;
foreach(dchar c; "noël"d.normalize)
writeln(cast(int) c);
110 // 'n'
111 // 'o'
235 // 'ë'
108 // 'l'
BTW, since I'm copy/pasting here, I'm honestly not sure if the
string is actually different in the D source or not, since they
display the same way...
But still, this is what's going on, and the normalize function is
the key to get these comparisons and reversals easier to do. A
normalized dstring is about as close as you can get to the
simplified ideal of one index into the array is one character.
More information about the Digitalmars-d-learn
mailing list