First Impressions
Georg Wrede
georg.wrede at nospam.org
Fri Sep 29 17:40:17 PDT 2006
Geoff Carlton wrote:
> Georg Wrede wrote:
>
>> The secret is, there actually is a delicate balance between UTF-8 and
>> the library string operations. As long as you use library functions to
>> extract substrings, join or manipulate them, everything is OK. And
>> very few of us actually either need to, or see the effort of
>> bit-twiddling individual octets in these "char" arrays.
>>
>> So things just keep on working.
>>
>
> I agree, but I disagree that there is a problem, or that utf-8 is a bad
> choice, or that perhaps char[] or string should be called utf8 instead.
>
> As a note here, I actually had a page of text localised into Chinese
> last week - it came back as a utf8 text file.
>
> The only thing with utf8 is that a glyphs aren't represented by a single
> char. But utf16 is no better! And even utf32 codepoints can be
> combined into a single rendered glyph. So truncating a string at an
> arbitrary index is not going to slice on a glyph boundary.
>
> However, it doesn't mean utf8 is ASCII mixed with "garbage" bytes. That
> garbage is a unique series of bytes that represent a codepoint. This is
> a property not found in any other encoding.
>
> As such, everything works, strstr, strchr, strcat, printf, scanf - for
> ASCII, normal unicode, and the "Astral planes". It all just works. The
> only thing that breaks is if you tried to index or truncate the data by
> hand.
>
> But even that mostly works, you can iterate through, looking for ASCII
> sequences, chop out ASCII and string together more stuff, it all works
> because you can just ignore the higher order bytes. Pretty much the
> only thing that fails is if you said "I don't know whats in the string,
> but chop it off at index 12".
Yes.
More information about the Digitalmars-d
mailing list