Table of strings sorting problem

Sat Mar 11 00:05:37 PST 2006

Hasan Aljudy wrote:
> S. Chancellor wrote:
> 
>> On 2006-03-10 17:20:35 -0800, Aarti <aarti at interia.pl> said:
>>
>>> Hello all D-Fans!
>>>
>>> I encountered a problem with string sorting according to Polish 
>>> language rules. Here is a simple test program:
>>>
>>> // ----------------------------------
>>> import std.stdio;
>>> void main() {
>>>     char[][] table;
>>>     table.length=15;
>>>         table[0]="ą";
>>>     table[1]="a";
>>>     table[2]="ć";
>>>     table[3]="c";
>>>     table[4]="ę";
>>>     table[5]="e";
>>>     table[6]="ń";
>>>     table[7]="n";
>>>     table[6]="ł";
>>>     table[7]="l";
>>>     table[8]="ó";
>>>     table[9]="o";
>>>     table[10]="ś";
>>>     table[11]="s";
>>>     table[12]="ź";
>>>     table[13]="ż";
>>>     table[14]="z";
>>>
>>>     table.sort;
>>>
>>>     foreach(char[] s; table) {
>>>         writef(s);
>>>     }
>>>     writefln();
>>> }
>>> // ----------------------------------
>>>
>>> Output of this test is:
>>> aceloszóąćęłśźż
>>>
>>> when it should be:
>>> aącćeęlłoósśzźż
>>>
>>> It looks like sort doesn't sort properly according to language rules.
>>>
>>> Is it a known issue? How to sort strings in D according to language 
>>> rules?
>>>
>>> PS. Possibility of using Polish characters in class identifiers is 
>>> for me really cool. In C++ books in examples you can see all the time 
>>> Trojkat instead of Trójkąt (triangle) and it looks awful.
>>>
>>> Regards
>>> Marcin Kuszczak
>>
>>
>>
>> Sort works off of the binary value of a character.  To implement a 
>> sort algorithm for polish language on characters would need to be 
>> manually done by you.  You would need to specify a map from the 
>> character to it's sort order and sort based on that.   I'm not sure if 
>> the sort property takes a delegate, that was something that was 
>> proposed before.   You could mainly say it's coincidence that the 
>> latin characters fall in order numerically.  (It was probably done on 
>> purpose with the person who decided the ASCII character values though.)
>>
>> -S.
>>
> 
> And note that the output
>  >> aceloszóąćęłśźż
> prints "english" characters first!! acelosz

Correction:  ASCII characters first, because they are in the range 
0-127.  Look at the unicode tables; they're publicly available.  Other 
latin languages use the ASCII characters.

The problem is language and culture-specific collation.  It is a very 
difficult problem to solve generically, since each language has many 
subcultures and each subculture agrees on different rules for collating 
text.  See discussions on ICU in the archives.

If one is looking for an explanation of the problem along with a 
collation solution, I would recommend: http://www.unicode.org/reports/tr10/

-- 
Regards,
James Dunne