std.string and unicode

Jari-Matti Mäkelä jmjmak at utu.fi.invalid
Sat Dec 16 18:59:21 PST 2006


Frits van Bommel wrote:
> Jari-Matti Mäkelä wrote:
>> Frits van Bommel wrote:
>>> Todor Totev wrote:
>>>> Hello all,
>>>> are std.string functions supposed to be UNICODE aware?
>>> Yes.
>>
>> No they are not.
> 
> All I said was they're _supposed_ to be Unicode aware, which was the
> question ;). I also said to report any that aren't.
> 

Ok, sorry - didn't mean to be rude. I just thought it might not be
realistic to expect them all to be Unicode aware in the near future.

Many of them were implemented to be pretty limited to ASCII last time I
checked. And they haven't changed a lot in the last few years. I don't
know, the 1.0 stable will be out on Jan 1. - I cannot foretell, how much
they will change before or after that. For example Ruby has been out a
while and still some of it's string functions are not compatible with
Unicode.

> But functions like find() (substring form) and replace(), for instance,
> shouldn't need special code to deal with UTF-8 vs. ASCII. In fact,
> anything that doesn't deal with properties of individual characters
> should probably be fine.

Unless the functions must be aware of different forms of the same
characters. The Unicode gurus can shed more light into this. I recall
there has been previous discussion of this also.

> Actually, I just did look through the entire source. The functions that
> seem to have trouble with UTF-8:
> 
> icmp: Only checks for range 'A'-'Z'
> ifind, irfind (substring forms): use icmp and are thus equally guilty.
> rjustify, ljustify, center, zfill: fail to account for multi-byte
> characters still being only one character (i.e. align according to byte
> length, not character length)
> maketrans, translate: Not really practical for full Unicode, but buggy
> still.
> soundex: AFAIK the Soundex algorithm is undefined for characters out of
> the ranges a-z and A-Z, so the fault lies more in the algorithm than
> this implementation of it.

Also tolower & toupper have problems with characters like Ә (04D8) and Һ
(04BA). Ok, I'm not 100% sure they are supposed to have upper and lower
case counterparts. Then there are some characters like Σ (03A3) that are
also used as symbols in e.g. mathematics and ordinary letters in some
languages.

Anyway, one cannot tell those methods about the locale to use in
conversions so I though they are not supposed to be Unicode compatible.
And the spec does not say whether they are ought to be compatible or
not. (OTOH I'm pretty much impressed about the capabilities of
tolower/toupper. Last time I tested those they were not even able to
change the case of ä:s and ö:s.)

I'm most probably the wrong person to answer this, but this is just the
first impression I got. Now that you mention it, it really seems that
Walter has improved those a lot. I'm not sure how much it matters to the
OP how things are supposed to be if they are not implemented yet. I'm
still using D for small scale projects that are in the end of their life
cycle before the spec / std library manages to get through such massive
changes.



More information about the Digitalmars-d mailing list