toLower() and Unicode are incomplete was: Re: avoid toLower in std.algorithm.sort compare alias

Sat Apr 21 19:25:33 PDT 2012

On Saturday, April 21, 2012 18:43:23 Ali Çehreli wrote:
> On 04/21/2012 04:24 PM, Jay Norwood wrote:
>  > While playing with sorting the unzip archive entries I tried use of the
>  > last example in http://dlang.org/phobos/std_algorithm.html#sort
>  > 
>  > std.algorithm.sort!("toLower(a.name) <
>  > toLower(b.name)",std.algorithm.SwapStrategy.stable)(entries);
> 
> Stealing this thread to point out that converting a letter to upper or
> lower case cannot be done without knowing the writing system. Phobos's
> toLower() documentation currently says: "Returns a string which is
> identical to s except that all of its characters are lowercase (in
> unicode, not just ASCII)."
> 
> Unicode cannot define the conversions of at least the following letters
> without knowing the actual alphabet that the text is written in:
> 
> - Lowercase of I is ı in some alphabets[*] and i in many others.
> 
> - Uppercase of i is İ in some alphabets[*] and I in many others.
> 
> Ali
> 
> [*] Turkish, Azeri, Chrimean Tatar, Gagauz, Celtic, etc.

toLower and toUpper get pretty screwing with unicode. I don't know enough 
about non-English alphabets to know what affects what, but at minimum, there 
are a number of cases where toLower does not reverse toUpper (and vice versa). 
Rather, it converts the character into yet another letter. So, toLower to 
toUpper with unicode and definitely a bit iffy. I suppose that they do the job 
if you call them enough on the string that it doesn't change anymore, but I 
don't know.

I also don't know how they act with regards to the various alphabets and how 
their implementation was decided upon. IIRC, Walter wrote them, and I'm sure 
that they're based on the unicode standard, but what that amounts to, I don't 
know.

- Jonathan M Davis