First Impressions

Fri Sep 29 16:19:56 PDT 2006

Georg Wrede wrote:

> The secret is, there actually is a delicate balance between UTF-8 and 
> the library string operations. As long as you use library functions to 
> extract substrings, join or manipulate them, everything is OK. And very 
> few of us actually either need to, or see the effort of bit-twiddling 
> individual octets in these "char" arrays.
> 
> So things just keep on working.
> 

I agree, but I disagree that there is a problem, or that utf-8 is a bad 
choice, or that perhaps char[] or string should be called utf8 instead.

As a note here, I actually had a page of text localised into Chinese 
last week - it came back as a utf8 text file.

The only thing with utf8 is that a glyphs aren't represented by a single 
char.  But utf16 is no better!  And even utf32 codepoints can be 
combined into a single rendered glyph.  So truncating a string at an 
arbitrary index is not going to slice on a glyph boundary.

However, it doesn't mean utf8 is ASCII mixed with "garbage" bytes.  That 
garbage is a unique series of bytes that represent a codepoint.  This is 
a property not found in any other encoding.

As such, everything works, strstr, strchr, strcat, printf, scanf - for 
ASCII, normal unicode, and the "Astral planes".  It all just works.  The 
only thing that breaks is if you tried to index or truncate the data by 
hand.

But even that mostly works, you can iterate through, looking for ASCII 
sequences, chop out ASCII and string together more stuff, it all works 
because you can just ignore the higher order bytes.  Pretty much the 
only thing that fails is if you said "I don't know whats in the string, 
but chop it off at index 12".