Today's programming challenge - How's your Range-Fu ?

Sat Apr 18 05:25:45 PDT 2015

On Saturday, 18 April 2015 at 11:52:52 UTC, Chris wrote:
> On Saturday, 18 April 2015 at 11:35:47 UTC, Jacob Carlborg 
> wrote:
>> On 2015-04-18 12:27, Walter Bright wrote:
>>
>>> That doesn't make sense to me, because the umlauts and the 
>>> accented e
>>> all have Unicode code point assignments.
>>
>> This code snippet demonstrates the problem:
>>
>> import std.stdio;
>>
>> void main ()
>> {
>>    dstring a = "e\u0301";
>>    dstring b = "é";
>>    assert(a != b);
>>    assert(a.length == 2);
>>    assert(b.length == 1);
>>    writefln(a, " ", b);
>> }
>>
>> If you run the above code all asserts should pass. If your 
>> system correctly supports Unicode (works on OS X 10.10) the 
>> two printed characters should look exactly the same.
>>
>> \u0301 is the "combining acute accent" [1].
>>
>> [1] http://www.fileformat.info/info/unicode/char/0301/index.htm
>
> Yep, this was the cause of some bugs I had in my program. The 
> thing is you never know, if a text is composed or decomposed, 
> so you have to be prepared that "é" has length 2 or 1. On OS X 
> these characters are automatically decomposed by default. So if 
> you pipe it through the system an "é" (length=1) automatically 
> becomes "e\u0301" (length=2). Same goes for file names on OS X. 
> I've had to find a workaround for this more than once.

byGrapheme to the rescue:

http://dlang.org/phobos/std_uni.html#byGrapheme

Or is this unsuitable here?