std.string and unicode

Sat Dec 16 17:41:24 PST 2006

Jari-Matti Mäkelä wrote:
> Frits van Bommel wrote:
>> Todor Totev wrote:
>>> Hello all,
>>> are std.string functions supposed to be UNICODE aware?
>> Yes.
> 
> No they are not.

All I said was they're _supposed_ to be Unicode aware, which was the 
question ;). I also said to report any that aren't.

 > The string constants (lowercase, letters, etc.) only
> feature ASCII characters. Some functions have "BUG: only works with
> ASCII" attached to them. Many of the functions expect that the char[]
> string consists of 8 bit characters.

First, let me note I was mostly just repeating what I remember from a 
previous iteration of this discussion. I didn't actually look through 
the entire source.
I guess those with "BUG" in them are known bugs, which will hopefully be 
fixed.
But functions like find() (substring form) and replace(), for instance, 
shouldn't need special code to deal with UTF-8 vs. ASCII. In fact, 
anything that doesn't deal with properties of individual characters 
should probably be fine.

> Mango project (http://dsource.org/projects/mango) has functions with
> better Unicode compatibility.

Always good to know if I ever need it.

...

[some time later]
Actually, I just did look through the entire source. The functions that 
seem to have trouble with UTF-8:

icmp: Only checks for range 'A'-'Z'
ifind, irfind (substring forms): use icmp and are thus equally guilty.
rjustify, ljustify, center, zfill: fail to account for multi-byte 
characters still being only one character (i.e. align according to byte 
length, not character length)
maketrans, translate: Not really practical for full Unicode, but buggy 
still.
soundex: AFAIK the Soundex algorithm is undefined for characters out of 
the ranges a-z and A-Z, so the fault lies more in the algorithm than 
this implementation of it.

The rest seems to handle it just fine, as long as you only pass valid 
UTF-8 and any indexes passed in are at character boundaries.

That's a score of 10 ASCII functions and about 70 UTF-8 functions. I 
think the phrase 'many of the functions' is a bit exaggerated here 
(especially since the last 7 are a bit obscure IMHO), but YMMV.

Note that I didn't follow my own advice and actually test these; this is 
based purely on browsing through the file.