Ceci n'est pas une char (was: Re: Selectable encodings)

Mike Capp mike.capp at gmail.com
Thu Apr 6 10:28:11 PDT 2006


(Changing subject line since we seem to have rudely hijacked the OP's topic)

In article <e13b56$is0$1 at digitaldaemon.com>,
=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= says...
>
>James Dunne wrote:
>
>> The char type is really a misnomer for dealing with UTF-8 encoded 
>> strings.  It should be named closer to "code-unit for UTF-8 encoding". 

(I fully agree with this statement, by the way.)

>Yeah, but it does hold an *ASCII* character ?

I don't find that very helpful - seeing a char[] in code doesn't tell me
anything about whether it's byte-per-character ASCII or possibly-multibyte
UTF-8.

>For the general case, UTF-32 is a pretty wasteful
>Unicode encoding just to have that priviledge ?

I'm not sure there is a "general case", so it's hard to say. Some programmers
have to deal with MBCS every day; others can go for years without ever having to
worry about anything but vanilla ASCII.

"Wasteful" is also relative. UTF-32 is certainly wasteful of memory space, but
UTF-8 is potentially far more wasteful of CPU cycles and memory bandwidth.
Finding the millionth character in a UTF-8 string means looping through at least
a million bytes, and executing some conditional logic for each one. Finding the
millionth character in a UTF-32 string is a simple pointer offset and one-word
fetch.

At the risk of repeating James, I do think that spelling "string" as
"char[]"/"wchar[]" is grossly misleading, particularly to people coming from any
other C-family language. If I was doing any serious string-handling work in D
I'd almost certainly write a opaque String class that overloaded opIndex
(returning dchar) to do the right thing, and optimised the underlying storage to
suit the app's requirements.

cheers
Mike





More information about the Digitalmars-d mailing list