Turkish 'I's can't D either
Michel Fortin
michel.fortin at michelf.com
Tue Aug 25 04:40:42 PDT 2009
On 2009-08-25 00:23:25 -0400, Ali Cehreli <acehreli at yahoo.com> said:
> You may be aware of the problems related to the consistency of the two
> separate letter 'I's in the Turkish alphabet (and the alphabets that
> are based on the Turkish alphabet).
>
> Lowercase and uppercase versions of the two are consistent in whether
> they have a dot or not:
>
> http://en.wikipedia.org/wiki/Turkish_I
>
> Turkish alphabet being in a position so close to the western alphabets,
> but not close enough, puts it in a strange position. (Strangely; the
> same applies geographically, politically, socially, etc. as well... ;))
>
> Computer systems *almost* work for Turkish, but not for those two letters.
>
> I love the fact that D allows Unicode letters in the source code and
> that it natively supports Unicode. I cannot stress enough how important
> this is. That is the single biggest reason why I decided to finally
> write a programming tutorial. Thank you to all who proposed and
> implemented those features!
>
> Back to the Turquois 'I's... What a programmer is to do who is writing
> programs that deals with Turkish letters?
>
> a) Accept that Phobos too has this age old behavior that is a result of
> premature optimization (i.e. this code in tolower: c + (cast(char)'a' -
> 'A'))
>
> b) Accept that the problem is unsolvable because the letter I has two
> minuscules, and the letter i has two majuscules anyway, and that the
> intent is not always clear
>
> c) Accept Turkish alphabet as being pathological (merely for being in
> the minority!), and use a Turkish version of Phobos or some other
> library
>
> d) Solve the problem with locale support
>
> Is option d possible with today's systems? Whose resposibility is this
> anyway? OS? Language? Program? Something else?
>
> The fact that alphanumerical ordering is also of interest, I think this
> has something to do with locales.
>
> Is there a way for a program to work with Turkish letters and ensure
> that the following program produces the expected output of 'dotless i',
> 'I with dot', and 0?
>
> import std.stdio;
> import std.string;
> import std.c.locale;
> import std.uni;
>
> void main()
> {
> const char * result = setlocale(LC_ALL, "tr_TR.UTF-8");
> assert(result);
>
> writeln(toUniLower('I'));
> writeln(toUniUpper('i'));
> writeln(indexOf("I",
> '\u0131', // dotless i
> (CaseSensitive).no));
> }
>
> This is a practical question. I really want to be able to work with
> Turkish... :)
Perhaps this could be of some inspiration. In Cocoa you can pass a
locale argument to many string methods (unfortunatly, not
lowercaseString or uppercaseStrings) to get the desired result. For
instance, the "rangeOfString:options:range:locale:" method can search
for substrings case-insentively, and it specifically discuss the
Turkish “ı” character under the locale parameter.
http://developer.apple.com/documentation/Cocoa/Reference/Foundation/Classes/NSString_Class/Reference/NSString.html#//apple_ref/occ/instm/NSString/rangeOfString:options:range:locale:
It's
also interesting to see that when you search for ß in a webpage using
Safari, it also matches every instance of SS (whatever your locale). ß
is a german character that becomes SS in uppercase.
- - -
What I'd like to see is an a base class representing a locale. Then you
can instanciate the locale you want (from a config file, by coding it
directly, having bindings to system APIs, or a mix of all this) and use
the locale. Something like:
class Locale
{
immutable:
string lowercase(string s);
string uppercase(string s);
int compare(string a, string b);
int compare(string a, string b);
// number & date formatting, etc.
}
immutable(Locale) systemLocale(); // get default system locale
immutable(Locale) locale(string localeName); // get best matching locale
void main()
{
Locale turkish = locale("tr-TR");
writeln(turkish.lowercase("I")); // writes "ı"
writeln(turkish.uppercase("i")); // writes "İ"
Locale english = locale("en-US");
writeln(english.lowercase("I")); // writes "i"
writeln(english.uppercase("i")); // writes "I"
writeln(systemLocale.lowercase("I")); // depends on user settings
writeln(systemLocale.uppercase("i")); // depends on user settings
}
This way you can work with many locales at once. And there's no
reliance on a global state.
--
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/
More information about the Digitalmars-d
mailing list