Ceci n'est pas une char

Thu Apr 6 12:02:27 PDT 2006

Mike Capp wrote:
> (Changing subject line since we seem to have rudely hijacked the OP's topic)
> 
> In article <e13b56$is0$1 at digitaldaemon.com>,
> =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= says...
>> James Dunne wrote:
>>
>>> The char type is really a misnomer for dealing with UTF-8 encoded 
>>> strings.  It should be named closer to "code-unit for UTF-8 encoding". 
> 
> (I fully agree with this statement, by the way.)
> 
>> Yeah, but it does hold an *ASCII* character ?
> 
> I don't find that very helpful - seeing a char[] in code doesn't tell me
> anything about whether it's byte-per-character ASCII or possibly-multibyte
> UTF-8.

Since UTF-8 is compatible with ASCII, might it not be reasonable to 
assume char strings are always UTF-8?  I'll admit this suggests many of 
the D string functions are broken, but they can certainly be fixed. 
I've been considering rewriting find and rfind to support multibyte 
strings.  Fixing find is pretty straightforward, though rfind might be a 
tad messy.  As a related question, can anyone verify whether 
std.utf.stride will return a correct result for evaluating an arbitrary 
offset in all potential input strings?

>> For the general case, UTF-32 is a pretty wasteful
>> Unicode encoding just to have that priviledge ?
> 
> I'm not sure there is a "general case", so it's hard to say. Some programmers
> have to deal with MBCS every day; others can go for years without ever having to
> worry about anything but vanilla ASCII.
> 
> "Wasteful" is also relative. UTF-32 is certainly wasteful of memory space, but
> UTF-8 is potentially far more wasteful of CPU cycles and memory bandwidth.
> Finding the millionth character in a UTF-8 string means looping through at least
> a million bytes, and executing some conditional logic for each one. Finding the
> millionth character in a UTF-32 string is a simple pointer offset and one-word
> fetch.

For what it's worth, I believe the correct behavior for string/array 
operations is to provide overloads for char[] and wchar[] that require 
input to be valid UTF-8 and UTF-16, respectively.  If the user knows 
their data is pure ASCII or they otherwise want to process it as a 
fixed-width string they can cast to ubyte[] or ushort[].  This is what 
I'm planning for std.array in Ares.

Sean