std.string.toUpper() for greek characters

Wed Oct 3 10:57:45 PDT 2012

On 03-Oct-12 18:11, Minas wrote:
> On Wednesday, 3 October 2012 at 13:27:25 UTC, Paulo Pinto wrote:
>> On Wednesday, 3 October 2012 at 10:56:11 UTC, Minas wrote:
>>> Currently, toUpper() (and probably toLower()) does not handle greek
>>> characters correctly. I fixed toUpper() by making a another function
>>> for greek characters

And a lot of others. And it is handwritten and thus unmaintainable.

>>>
>>> // called if (c >= 0x387 && c <= 0x3CE)
>>> dchar toUpperGreek(dchar c)
>>> {
>>>     if( c >= 'α' && c <= 'ω' )
>>>     {
>>>         if( c == 'ς' )
>>>             c = 'Σ';
>>>         else
>>>             c -= 32;
>>>     }
>>>     else
>>>     {
>>>         dchar[dchar] map;
>>>         map['ά'] = 'Ά';
>>>         map['έ'] = 'Έ';
>>>         map['ή'] = 'Ή';
>>>         map['ί'] = 'Ί';
>>>         map['ϊ'] = 'Ϊ';
>>>         map['ΐ'] = 'Ϊ';
>>>         map['ό'] = 'Ό';
>>>         map['ύ'] = 'Ύ';
>>>         map['ϋ'] = 'Ϋ';
>>>         map['ΰ'] = 'Ϋ';
>>>         map['ώ'] = 'Ώ';
>>>
>>>         c = map[c];
>>>     }
>>>
>>>     return c;
>>> }
>>>
>>> Then, in toUpper()
>>> {
>>>   ....
>>>   if (c >= 0x387 && c <= 0x3CE)
>>>      c = toUpperGreek()...
>>>   ///
>>> }
>>>
>>> Do you think it should stay like that or I should copy-paste it in
>>> the body of toUpper()?
>>>
>>> I'm going to fix toLower() as well and make a pull request.

I'm *strongly* against bringing these temporary hacks into standard 
library. The fact that toUpper/toLower are outdated is bad but fixing it 
by piling hack after hack on this mess of if/else branches is not the 
way out.
Also I hope you haven't lost a few hundreds over here:
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3Agreek%3A%5D+%26+%5B%3ACasedLetter%3A%5D&g=

The way out is a proper implementation that is is a direct derivative of 
the Unicode character database. And I've spent this summer on doing this 
proper 'cure' for these kind of problems with Unicode support in D.

Admittedly, my reworked Unicode support probably won't hit the next 
release(2.061). Needs to go through review etc. But I'm determined to 
get it to 2.062.

I'd suggest to keep around you personal version for the moment and then 
just switch to the new std one. However given our release schedule this 
could be anywhere from 4 months to 1 year away :)

>>
>> Regarding toLower() a problem I see is how to handle sigma (Σ),
>> because it has two possible lower case representations depending where
>> it occurs in a word. But of course toLower() is working on character
>> basis, so it cannot know what the receiver plans to do with the
>> character.
>>
>> --
>> Paulo
>
> Yeah, that's a problem indeed. I will make it become 'σ', and the
> programmer can change the final'σ' to 'ς' himself.

I think this is one of a small number of special cases, see the full 
list here:
ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt
(handling these subtleties is commonly called 'tailoring' and currently 
I believe is out reach for std library)

Currently mytoLower will do 'σ' as prescribed by simple case folding 
rules. (i.e. the ones that can only map 1:1).

I have case-insensitive string comparison that does 1:n mappings as well 
(and is going to replace current icmp) but it doesn't do tailoring.
One day we may add some language specific tailoring (via locales etc.) 
but we'd better do it carefully.

-- 
Dmitry Olshansky