To Walter, about char[] initialization by FF
Bruno Medeiros
brunodomedeirosATgmail at SPAM.com
Sun Jul 30 08:52:28 PDT 2006
Unknown W. Brackets wrote:
> 6. The FF byte (8-bit octet sequence) may never appear in any valid
> UTF-8 string. Since char can only contain UTF-8 strings, it represents
> invalid data if it contains such an 8-bit octet.
>
You mentioned "8-bit octet" repeatedly in various posts. That's
redundant: An "octet" is an 8-bit value. There are no "16-bit octets"
and no "8-bit hextets" or stuff like that :P . I hope you knew that and
were just distracted, but you kept saying that :) .
> 1. UTF-8 character here could mean an 8-bit octet of code point. In
> this case, they are both the same and represent a perfectly valid
> character in a string.
>
An "UTF-8 octet" is also called a UTF-8 'code unit'. Similarly a "UTF-16
hextet" is called a UTF-16 'code unit'. An UTF-8 code unit holds a
Unicode code point if the code point is <128. Otherwise multiple UTF-8
code units are needed to encode that code point.
The confusion between 'code unit' and 'code point' is a long standing
one. An "UTF-8 character" is a slighty ambiguous term. Does it a mean a
UTF-8 code unit, or does it mean an Unicode character/codepoint encoded
in a UTF-8 sequence?
--
Bruno Medeiros - MSc in CS/E student
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D
More information about the Digitalmars-d
mailing list