std.string will get the boot

Ali Çehreli acehreli at yahoo.com
Fri Jan 29 16:00:17 PST 2010


Jacob Carlborg wrote:
 > On 1/29/10 22:18, Ali Çehreli wrote:
 >> Jacob Carlborg wrote:
 >>
 >>  > I would keep std.string for string specific functions and perhaps
 >>  > publicly import std.algorithm. For exmaple functions like: tolower,
 >> icmp
 >>  > and toStringz.
 >>
 >> I've been thinking about characters lately and have realized that
 >> tolower, toupper, icmp, and friends should not be in a string library.
 >> Those functions need an "alphabet" to be useful; not language, nor
 >> locale...
 >>
 >> In fact, the character itself must have alphabet information. Otherwise
 >> a string like "ali & jim" cannot be converted to upper-case correctly(*)
 >> as "ALİ & JIM". And the word "correctly" there depends on each
 >> character's alphabet.
 >>
 >> Similarly, two characters that look the same cannot be compared for
 >> ordering. Comparing the 'x' of one alphabet to the 'x' of another
 >> alphabet is a meaningless operation.
 >>
 >> Ali
 >
 > I'm not sure I really understand this, probably because I don't know
 > much about how Unciode works. I'm thinking out loud:
 >
 > If "i", as you have in "ali", have the corresponding "İ" as upper case
 > wouldn't that be another character than the English "i"?

'i' and 'i' are the same "character", because they have the same ASCII 
and Unicode values in different alphabets. But it is not the same 
"letter" when they are part of different text.

iİ (and ıI) issue is probably too special. A number of Turkic alphabets 
chose ASCII 'i' probably for historical reasons. Unicode did not define 
a separate code point for 'i' either, probably because those alphabets 
already were using the ASCII 'i'.

 > If so, I'm not
 > sure I see the problem. If not, I see the problem.

The letter 'i' (and I) is special but the issue is valid for any other 
letter: Is it valid to compare an 'i' in English text to an 'i' in 
German text?

I think it's only valid at the lowest data representation level. And 
ASCII never claims to be more than a code table for "information 
interchange". That part is fine.

The problem is with the use of certain ranges of the ASCII table as the 
English alphabet. It is unfortunate that it works... :)

D is great that it supports three separate Unicode encodings in the 
language, but encodings are at a lower level of abstraction than 
"letters". I am not sure what data is used for toUniUpper and toUniLower 
in std.uni, but they can't work correctly without alphabet information. 
They favor the ASCII layout probabyl because for historical reasons.

I think the problems with using the ASCII table for sorting is well 
known. A more interesting example is with the Azeri alphabet: it uses 
the ASCII xX characters, but sorts them after hH.

Ali



More information about the Digitalmars-d mailing list