std.string and unicode
Frits van Bommel
fvbommel at REMwOVExCAPSs.nl
Sat Dec 16 17:41:24 PST 2006
Jari-Matti Mäkelä wrote:
> Frits van Bommel wrote:
>> Todor Totev wrote:
>>> Hello all,
>>> are std.string functions supposed to be UNICODE aware?
>> Yes.
>
> No they are not.
All I said was they're _supposed_ to be Unicode aware, which was the
question ;). I also said to report any that aren't.
> The string constants (lowercase, letters, etc.) only
> feature ASCII characters. Some functions have "BUG: only works with
> ASCII" attached to them. Many of the functions expect that the char[]
> string consists of 8 bit characters.
First, let me note I was mostly just repeating what I remember from a
previous iteration of this discussion. I didn't actually look through
the entire source.
I guess those with "BUG" in them are known bugs, which will hopefully be
fixed.
But functions like find() (substring form) and replace(), for instance,
shouldn't need special code to deal with UTF-8 vs. ASCII. In fact,
anything that doesn't deal with properties of individual characters
should probably be fine.
> Mango project (http://dsource.org/projects/mango) has functions with
> better Unicode compatibility.
Always good to know if I ever need it.
...
[some time later]
Actually, I just did look through the entire source. The functions that
seem to have trouble with UTF-8:
icmp: Only checks for range 'A'-'Z'
ifind, irfind (substring forms): use icmp and are thus equally guilty.
rjustify, ljustify, center, zfill: fail to account for multi-byte
characters still being only one character (i.e. align according to byte
length, not character length)
maketrans, translate: Not really practical for full Unicode, but buggy
still.
soundex: AFAIK the Soundex algorithm is undefined for characters out of
the ranges a-z and A-Z, so the fault lies more in the algorithm than
this implementation of it.
The rest seems to handle it just fine, as long as you only pass valid
UTF-8 and any indexes passed in are at character boundaries.
That's a score of 10 ASCII functions and about 70 UTF-8 functions. I
think the phrase 'many of the functions' is a bit exaggerated here
(especially since the last 7 are a bit obscure IMHO), but YMMV.
Note that I didn't follow my own advice and actually test these; this is
based purely on browsing through the file.
More information about the Digitalmars-d
mailing list