Unicode handling comparison

Wed Nov 27 11:19:53 PST 2013

On 11/27/2013 06:45 AM, David Nadlinger wrote:
> On Wednesday, 27 November 2013 at 12:46:38 UTC, bearophile wrote:
>> Through Reddit I have seen this small comparison of Unicode handling 
>> between different programming languages:
>>
>> http://mortoray.com/2013/11/27/the-string-type-is-broken/
>>
>> D+Phobos seem to fail most things (it produces BAFFLE):
>> http://dpaste.dzfl.pl/a5268c435
>
> If you need to perform this kind of operations on Unicode strings in 
> D, you can call normalize (std.uni) on the string first to make sure 
> it is in one of the Normalization Forms. For example, just appending 
> .normalize to your strings (which defaults to NFC) would make the code 
> produce the "expected" results.
>
> As far as I'm aware, this behavior is the result of a deliberate 
> decision, as normalizing strings on the fly isn't really cheap.
>
> David
>
I don't like the overhead, and I don't know how important this is, but 
perhaps the best way to solve it would be to have string include a 
"normalization" byte, saying whether it was normalized, and if so in 
what way.  That there can be multiple ways of normalizing is painful, 
but it *is* the standard.  And this would allow normalization to be 
skipped whenever the comparison of two strings showed the same 
normalization (or lack thereof).  What to do if they're normalized 
differently is a bit of a puzzle, but most reasonable solutions would 
work for most cases, so you just need a way to override the defaults.

-- 
Charles Hixson