Today's programming challenge - How's your Range-Fu ?

Mon Apr 20 07:57:59 PDT 2015

On Monday, 20 April 2015 at 11:04:58 UTC, Panke wrote:
>>
>> Yes, again and again I encountered length related bugs with 
>> Unicode characters. Normalization is not 100% reliable.
>
> I think it is 100% reliable, it just doesn't make the problems 
> go away. It just guarantees that two strings normalized to the 
> same form are binary equal iff they are equal in the unicode 
> sense. Nothing about columns or string length or grapheme count.

The problem is not normalization as such, the problem is with 
string (as opposed to dstring):

import std.uni : normalize, NFC;
void main() {

   dstring de_one = "é";
   dstring de_two = "e\u0301";

   assert(de_one.length == 1);
   assert(de_two.length == 2);

   string e_one = "é";
   string e_two = "e\u0301";

   string random = "ab";

   assert(e_one.length == 2);
   assert(e_two.length == 3);
   assert(e_one.length == random.length);

   assert(normalize!NFC(e_one).length == 2);
   assert(normalize!NFC(e_two).length == 2);
}

This can lead to subtle bugs, cf. length of random and e_one. You 
have to convert everything to dstring to get the "expected" 
result. However, this is not always desirable.