std.string.toUpper() for greek characters
Dmitry Olshansky
dmitry.olsh at gmail.com
Wed Oct 3 10:57:45 PDT 2012
On 03-Oct-12 18:11, Minas wrote:
> On Wednesday, 3 October 2012 at 13:27:25 UTC, Paulo Pinto wrote:
>> On Wednesday, 3 October 2012 at 10:56:11 UTC, Minas wrote:
>>> Currently, toUpper() (and probably toLower()) does not handle greek
>>> characters correctly. I fixed toUpper() by making a another function
>>> for greek characters
And a lot of others. And it is handwritten and thus unmaintainable.
>>>
>>> // called if (c >= 0x387 && c <= 0x3CE)
>>> dchar toUpperGreek(dchar c)
>>> {
>>> if( c >= 'α' && c <= 'ω' )
>>> {
>>> if( c == 'ς' )
>>> c = 'Σ';
>>> else
>>> c -= 32;
>>> }
>>> else
>>> {
>>> dchar[dchar] map;
>>> map['ά'] = 'Ά';
>>> map['έ'] = 'Έ';
>>> map['ή'] = 'Ή';
>>> map['ί'] = 'Ί';
>>> map['ϊ'] = 'Ϊ';
>>> map['ΐ'] = 'Ϊ';
>>> map['ό'] = 'Ό';
>>> map['ύ'] = 'Ύ';
>>> map['ϋ'] = 'Ϋ';
>>> map['ΰ'] = 'Ϋ';
>>> map['ώ'] = 'Ώ';
>>>
>>> c = map[c];
>>> }
>>>
>>> return c;
>>> }
>>>
>>> Then, in toUpper()
>>> {
>>> ....
>>> if (c >= 0x387 && c <= 0x3CE)
>>> c = toUpperGreek()...
>>> ///
>>> }
>>>
>>> Do you think it should stay like that or I should copy-paste it in
>>> the body of toUpper()?
>>>
>>> I'm going to fix toLower() as well and make a pull request.
I'm *strongly* against bringing these temporary hacks into standard
library. The fact that toUpper/toLower are outdated is bad but fixing it
by piling hack after hack on this mess of if/else branches is not the
way out.
Also I hope you haven't lost a few hundreds over here:
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3Agreek%3A%5D+%26+%5B%3ACasedLetter%3A%5D&g=
The way out is a proper implementation that is is a direct derivative of
the Unicode character database. And I've spent this summer on doing this
proper 'cure' for these kind of problems with Unicode support in D.
Admittedly, my reworked Unicode support probably won't hit the next
release(2.061). Needs to go through review etc. But I'm determined to
get it to 2.062.
I'd suggest to keep around you personal version for the moment and then
just switch to the new std one. However given our release schedule this
could be anywhere from 4 months to 1 year away :)
>>
>> Regarding toLower() a problem I see is how to handle sigma (Σ),
>> because it has two possible lower case representations depending where
>> it occurs in a word. But of course toLower() is working on character
>> basis, so it cannot know what the receiver plans to do with the
>> character.
>>
>> --
>> Paulo
>
> Yeah, that's a problem indeed. I will make it become 'σ', and the
> programmer can change the final'σ' to 'ς' himself.
I think this is one of a small number of special cases, see the full
list here:
ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt
(handling these subtleties is commonly called 'tailoring' and currently
I believe is out reach for std library)
Currently mytoLower will do 'σ' as prescribed by simple case folding
rules. (i.e. the ones that can only map 1:1).
I have case-insensitive string comparison that does 1:n mappings as well
(and is going to replace current icmp) but it doesn't do tailoring.
One day we may add some language specific tailoring (via locales etc.)
but we'd better do it carefully.
--
Dmitry Olshansky
More information about the Digitalmars-d-announce
mailing list