Ceci n'est pas une char

Thu Apr 6 12:47:22 PDT 2006

Georg Wrede wrote:
> (( I sure wish there was somebody in this NG who could write a
> Scientifically Valid test to compare the time needed to find the
> millionth character in UTF-8 vs. UTF-8 first converted to UTF-32. ))

It's O(n) vs O(n). :) You have to go through all the bytes in both
cases. I guess the conversion has a higher coefficient.

> So, of course for library writers, this appears as most relevant, but
> for real world programming tasks, I think after profiling, the time
> wasted may be minor, in practice.

Why not use the same encoding throughout the whole program and it's
libraries? No need to convert anywhere.

> (Ah, and of course, turning a UTF-8 input into UTF-32 and then straight
> shooting the millionth character, is way more expensive (both in time
> and size) than just a loop through the UTF-8 as such. Not to mention the
> losses if one were, instead, to have a million-character file on hard
> disk in UTF-32 (i.e. a 4MB file) to avoid the look-through. Probably the
> time reading in the file gets so much longer that this in itself defeats
> the "gain".)

That's very true. A "normal" hard drive reads 60 MB/s. So, reading a 4
MB file takes at least 66 ms and a 1 MB UTF-8-file (only
ASCII-characters) is read in 17 ms (well, I'm a bit optimistic here :).
A modern processor executes 3 000 000 000 operations in a second. Going
through the UTF-8 stream takes 1 000 000 * 10 (perhaps?) operations and
thus costs 3 ms. So it's actually faster to read UTF-8.

-- 
Jari-Matti