How to correctly deal with unicode strings?

Wed Nov 27 06:55:30 PST 2013

The normalize function in std.uni helps a lot here (well, it 
would if it actually compiled, but that's an easy fix, just 
import std.typecons in your copy of phobos/src/uni.d. It does so 
already but only in version(unittest)! LOL)

dstrings are a bit easier to use too:

import std.algorithm;
import std.stdio;
import std.uni;

void main(string[] args)
{
         dstring x = "noël"d.normalize;

         assert(x.length == 4); // Expected.

         assert(x[0 .. 3] == "noë"d.normalize); // Expected.

         import std.range;
         dstring y = x.retro.array;

         assert(y == "lëon"d.normalize); // Expected.
}

All of that works. The normalize function does the character 
pairs, like 'ë', into single code points. Take a gander at this:

         foreach(dchar c; "noël"d)
                 writeln(cast(int) c);

110 // 'n'
111 // 'o'
101 // 'e'
776 // special character to add the dieresis to the preceding 
character
108 // 'l'

This btw is a great example of why .length should *not* be 
expected to give the number of characters, not even with 
dstrings, since a code point is not necessarily the same as a 
character! And, of course, with string, a code point is often not 
the same as a code unit.

What the normalize function does is goes through and combines 
those combining characters into one thing:

         import std.uni;
         foreach(dchar c; "noël"d.normalize)
                 writeln(cast(int) c);

110 // 'n'
111 // 'o'
235 // 'ë'
108 // 'l'

BTW, since I'm copy/pasting here, I'm honestly not sure if the 
string is actually different in the D source or not, since they 
display the same way...

But still, this is what's going on, and the normalize function is 
the key to get these comparisons and reversals easier to do. A 
normalized dstring is about as close as you can get to the 
simplified ideal of one index into the array is one character.