Ceci n'est pas une char

Thu Apr 6 11:37:56 PDT 2006

Mike Capp wrote:
> (Changing subject line since we seem to have rudely hijacked the OP's
> topic)
> 
> In article <e13b56$is0$1 at digitaldaemon.com>, 
> =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= says...
> 
>> James Dunne wrote:
>> 
>>> The char type is really a misnomer for dealing with UTF-8 encoded
>>> strings.  It should be named closer to "code-unit for UTF-8
>>> encoding".
> 
> (I fully agree with this statement, by the way.)

Yes. And it's a _gross_ misnomer.

And we who are used to D can't even _begin_ to appreciate the 
[unnecessary!] extra work and effort needed to gradually come to 
understand it "our way", for those new to D.

>> Yeah, but it does hold an *ASCII* character ?
> 
> I don't find that very helpful - seeing a char[] in code doesn't tell
> me anything about whether it's byte-per-character ASCII or
> possibly-multibyte UTF-8.

(( A dumb idea: the input stream has a flag that gets set as soon as the 
first non-ASCII character is found. ))

>> For the general case, UTF-32 is a pretty wasteful Unicode encoding
>> just to have that priviledge ?
> 
> I'm not sure there is a "general case", so it's hard to say. Some
> programmers have to deal with MBCS every day; others can go for years
> without ever having to worry about anything but vanilla ASCII.

True!! Folks in Boise, Idaho, vs. folks in the non-British Europe or the 
Far East.

> "Wasteful" is also relative. UTF-32 is certainly wasteful of memory
> space, but UTF-8 is potentially far more wasteful of CPU cycles and
> memory bandwidth.

It sure looks like it. Then again, studying the UTF-8 spec, and "why we 
did it this way" (sorry, no URL here. Anybody?), shows that it actually 
is _amazingly_ light on CPU cycles! Really.

(( I sure wish there was somebody in this NG who could write a 
Scientifically Valid test to compare the time needed to find the 
millionth character in UTF-8 vs. UTF-8 first converted to UTF-32. ))

> Finding the millionth character in a UTF-8 string
> means looping through at least a million bytes, and executing some
> conditional logic for each one. Finding the millionth character in a
> UTF-32 string is a simple pointer offset and one-word fetch.

True. And even if we'd exclude any "character width logic" in the 
search, we still end up with sequential lookup O(n) vs. O(1).

Then again, when's the last time anyone here had to find the millionth 
character of anything?  :-)

So, of course for library writers, this appears as most relevant, but 
for real world programming tasks, I think after profiling, the time 
wasted may be minor, in practice.

(Ah, and of course, turning a UTF-8 input into UTF-32 and then straight 
shooting the millionth character, is way more expensive (both in time 
and size) than just a loop through the UTF-8 as such. Not to mention the 
losses if one were, instead, to have a million-character file on hard 
disk in UTF-32 (i.e. a 4MB file) to avoid the look-through. Probably the 
time reading in the file gets so much longer that this in itself defeats 
the "gain".)

> At the risk of repeating James, I do think that spelling "string" as 
> "char[]"/"wchar[]" is grossly misleading, particularly to people
> coming from any other C-family language.

No argument here. :-)

In the midst of The Great Character Width Brouhaha (about November last 
year), I tried to convince Walter on this particular issue.