Today's programming challenge - How's your Range-Fu ?

Sun Apr 19 00:26:35 PDT 2015

On Saturday, 18 April 2015 at 16:01:20 UTC, Andrei Alexandrescu 
wrote:
> On 4/18/15 4:35 AM, Jacob Carlborg wrote:
>> On 2015-04-18 12:27, Walter Bright wrote:
>>
>>> That doesn't make sense to me, because the umlauts and the 
>>> accented e
>>> all have Unicode code point assignments.
>>
>> This code snippet demonstrates the problem:
>>
>> import std.stdio;
>>
>> void main ()
>> {
>>     dstring a = "e\u0301";
>>     dstring b = "é";
>>     assert(a != b);
>>     assert(a.length == 2);
>>     assert(b.length == 1);
>>     writefln(a, " ", b);
>> }
>>
>> If you run the above code all asserts should pass. If your 
>> system
>> correctly supports Unicode (works on OS X 10.10) the two 
>> printed
>> characters should look exactly the same.
>>
>> \u0301 is the "combining acute accent" [1].
>>
>> [1] http://www.fileformat.info/info/unicode/char/0301/index.htm
>
> Isn't this solved commonly with a normalization pass? We should 
> have a normalizeUTF() that can be inserted in a pipeline. Then 
> the rest of Phobos doesn't need to mind these combining 
> characters. -- Andrei

Normalisation can allow some simplifications, sometimes, but 
knowing whether it will or not requires a lot of a priori 
knowledge about the input as well as the normalisation form.