First Impressions

Fri Sep 29 17:40:17 PDT 2006

Geoff Carlton wrote:
> Georg Wrede wrote:
> 
>> The secret is, there actually is a delicate balance between UTF-8 and 
>> the library string operations. As long as you use library functions to 
>> extract substrings, join or manipulate them, everything is OK. And 
>> very few of us actually either need to, or see the effort of 
>> bit-twiddling individual octets in these "char" arrays.
>>
>> So things just keep on working.
>>
> 
> I agree, but I disagree that there is a problem, or that utf-8 is a bad 
> choice, or that perhaps char[] or string should be called utf8 instead.
> 
> As a note here, I actually had a page of text localised into Chinese 
> last week - it came back as a utf8 text file.
> 
> The only thing with utf8 is that a glyphs aren't represented by a single 
> char.  But utf16 is no better!  And even utf32 codepoints can be 
> combined into a single rendered glyph.  So truncating a string at an 
> arbitrary index is not going to slice on a glyph boundary.
> 
> However, it doesn't mean utf8 is ASCII mixed with "garbage" bytes.  That 
> garbage is a unique series of bytes that represent a codepoint.  This is 
> a property not found in any other encoding.
> 
> As such, everything works, strstr, strchr, strcat, printf, scanf - for 
> ASCII, normal unicode, and the "Astral planes".  It all just works.  The 
> only thing that breaks is if you tried to index or truncate the data by 
> hand.
> 
> But even that mostly works, you can iterate through, looking for ASCII 
> sequences, chop out ASCII and string together more stuff, it all works 
> because you can just ignore the higher order bytes.  Pretty much the 
> only thing that fails is if you said "I don't know whats in the string, 
> but chop it off at index 12".

Yes.