First Impressions
Geoff Carlton
gcarlton at iinet.net.au
Fri Sep 29 16:19:56 PDT 2006
Georg Wrede wrote:
> The secret is, there actually is a delicate balance between UTF-8 and
> the library string operations. As long as you use library functions to
> extract substrings, join or manipulate them, everything is OK. And very
> few of us actually either need to, or see the effort of bit-twiddling
> individual octets in these "char" arrays.
>
> So things just keep on working.
>
I agree, but I disagree that there is a problem, or that utf-8 is a bad
choice, or that perhaps char[] or string should be called utf8 instead.
As a note here, I actually had a page of text localised into Chinese
last week - it came back as a utf8 text file.
The only thing with utf8 is that a glyphs aren't represented by a single
char. But utf16 is no better! And even utf32 codepoints can be
combined into a single rendered glyph. So truncating a string at an
arbitrary index is not going to slice on a glyph boundary.
However, it doesn't mean utf8 is ASCII mixed with "garbage" bytes. That
garbage is a unique series of bytes that represent a codepoint. This is
a property not found in any other encoding.
As such, everything works, strstr, strchr, strcat, printf, scanf - for
ASCII, normal unicode, and the "Astral planes". It all just works. The
only thing that breaks is if you tried to index or truncate the data by
hand.
But even that mostly works, you can iterate through, looking for ASCII
sequences, chop out ASCII and string together more stuff, it all works
because you can just ignore the higher order bytes. Pretty much the
only thing that fails is if you said "I don't know whats in the string,
but chop it off at index 12".
More information about the Digitalmars-d
mailing list