Turkish 'I's can't D either

Tue Aug 25 04:40:42 PDT 2009

On 2009-08-25 00:23:25 -0400, Ali Cehreli <acehreli at yahoo.com> said:

> You may be aware of the problems related to the consistency of the two 
> separate letter 'I's in the Turkish alphabet (and the alphabets that 
> are based on the Turkish alphabet).
> 
> Lowercase and uppercase versions of the two are consistent in whether 
> they have a dot or not:
> 
>   http://en.wikipedia.org/wiki/Turkish_I
> 
> Turkish alphabet being in a position so close to the western alphabets, 
> but not close enough, puts it in a strange position. (Strangely; the 
> same applies geographically, politically, socially, etc. as well... ;))
> 
> Computer systems *almost* work for Turkish, but not for those two letters.
> 
> I love the fact that D allows Unicode letters in the source code and 
> that it natively supports Unicode. I cannot stress enough how important 
> this is. That is the single biggest reason why I decided to finally 
> write a programming tutorial. Thank you to all who proposed and 
> implemented those features!
> 
> Back to the Turquois 'I's... What a programmer is to do who is writing 
> programs that deals with Turkish letters?
> 
> a) Accept that Phobos too has this age old behavior that is a result of 
> premature optimization (i.e. this code in tolower: c + (cast(char)'a' - 
> 'A'))
> 
> b) Accept that the problem is unsolvable because the letter I has two 
> minuscules, and the letter i has two majuscules anyway, and that the 
> intent is not always clear
> 
> c) Accept Turkish alphabet as being pathological (merely for being in 
> the minority!), and use a Turkish version of Phobos or some other 
> library
> 
> d) Solve the problem with locale support
> 
> Is option d possible with today's systems? Whose resposibility is this 
> anyway? OS? Language? Program? Something else?
> 
> The fact that alphanumerical ordering is also of interest, I think this 
> has something to do with locales.
> 
> Is there a way for a program to work with Turkish letters and ensure 
> that the following program produces the expected output of 'dotless i', 
> 'I with dot', and 0?
> 
> import std.stdio;
> import std.string;
> import std.c.locale;
> import std.uni;
> 
> void main()
> {
>     const char * result = setlocale(LC_ALL, "tr_TR.UTF-8");
>     assert(result);
> 
>     writeln(toUniLower('I'));
>     writeln(toUniUpper('i'));
>     writeln(indexOf("I",
>                     '\u0131',               // dotless i
>                     (CaseSensitive).no));
> }
> 
> This is a practical question. I really want to be able to work with 
> Turkish... :)

Perhaps this could be of some inspiration. In Cocoa you can pass a 
locale argument to many string methods (unfortunatly, not 
lowercaseString or uppercaseStrings) to get the desired result. For 
instance, the "rangeOfString:options:range:locale:" method can search 
for substrings case-insentively, and it specifically discuss the 
Turkish “ı” character under the locale parameter.

http://developer.apple.com/documentation/Cocoa/Reference/Foundation/Classes/NSString_Class/Reference/NSString.html#//apple_ref/occ/instm/NSString/rangeOfString:options:range:locale:

It's 

also interesting to see that when you search for ß in a webpage using 
Safari, it also matches every instance of SS (whatever your locale). ß 
is a german character that becomes SS in uppercase.

 - - -

What I'd like to see is an a base class representing a locale. Then you 
can instanciate the locale you want (from a config file, by coding it 
directly, having bindings to system APIs, or a mix of all this) and use 
the locale. Something like:

	class Locale
	{
	immutable:
		string lowercase(string s);
		string uppercase(string s);

		int compare(string a, string b);
		int compare(string a, string b);

		// number & date formatting, etc.
	}

	immutable(Locale) systemLocale();              // get default system locale
	immutable(Locale) locale(string localeName); // get best matching locale

	void main()
	{
		Locale turkish = locale("tr-TR");
	    writeln(turkish.lowercase("I")); // writes "ı"
	    writeln(turkish.uppercase("i")); // writes "İ"

		Locale english = locale("en-US");
	    writeln(english.lowercase("I")); // writes "i"
	    writeln(english.uppercase("i")); // writes "I"

	    writeln(systemLocale.lowercase("I")); // depends on user settings
	    writeln(systemLocale.uppercase("i")); // depends on user settings
	}

This way you can work with many locales at once. And there's no 
reliance on a global state.

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/