Ceci n'est pas une char
Sean Kelly
sean at f4.ca
Thu Apr 6 12:02:27 PDT 2006
Mike Capp wrote:
> (Changing subject line since we seem to have rudely hijacked the OP's topic)
>
> In article <e13b56$is0$1 at digitaldaemon.com>,
> =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= says...
>> James Dunne wrote:
>>
>>> The char type is really a misnomer for dealing with UTF-8 encoded
>>> strings. It should be named closer to "code-unit for UTF-8 encoding".
>
> (I fully agree with this statement, by the way.)
>
>> Yeah, but it does hold an *ASCII* character ?
>
> I don't find that very helpful - seeing a char[] in code doesn't tell me
> anything about whether it's byte-per-character ASCII or possibly-multibyte
> UTF-8.
Since UTF-8 is compatible with ASCII, might it not be reasonable to
assume char strings are always UTF-8? I'll admit this suggests many of
the D string functions are broken, but they can certainly be fixed.
I've been considering rewriting find and rfind to support multibyte
strings. Fixing find is pretty straightforward, though rfind might be a
tad messy. As a related question, can anyone verify whether
std.utf.stride will return a correct result for evaluating an arbitrary
offset in all potential input strings?
>> For the general case, UTF-32 is a pretty wasteful
>> Unicode encoding just to have that priviledge ?
>
> I'm not sure there is a "general case", so it's hard to say. Some programmers
> have to deal with MBCS every day; others can go for years without ever having to
> worry about anything but vanilla ASCII.
>
> "Wasteful" is also relative. UTF-32 is certainly wasteful of memory space, but
> UTF-8 is potentially far more wasteful of CPU cycles and memory bandwidth.
> Finding the millionth character in a UTF-8 string means looping through at least
> a million bytes, and executing some conditional logic for each one. Finding the
> millionth character in a UTF-32 string is a simple pointer offset and one-word
> fetch.
For what it's worth, I believe the correct behavior for string/array
operations is to provide overloads for char[] and wchar[] that require
input to be valid UTF-8 and UTF-16, respectively. If the user knows
their data is pure ASCII or they otherwise want to process it as a
fixed-width string they can cast to ubyte[] or ushort[]. This is what
I'm planning for std.array in Ares.
Sean
More information about the Digitalmars-d
mailing list