To Walter, about char[] initialization by FF

Bruno Medeiros brunodomedeirosATgmail at SPAM.com
Sun Jul 30 08:52:28 PDT 2006


Unknown W. Brackets wrote:
> 6. The FF byte (8-bit octet sequence) may never appear in any valid 
> UTF-8 string.  Since char can only contain UTF-8 strings, it represents 
> invalid data if it contains such an 8-bit octet.
> 
You mentioned "8-bit octet" repeatedly in various posts. That's 
redundant: An "octet" is an 8-bit value. There are no "16-bit octets" 
and no "8-bit hextets" or stuff like that :P . I hope you knew that and 
were just distracted, but you kept saying that :) .

> 1. UTF-8 character here could mean an 8-bit octet of code point.  In 
> this case, they are both the same and represent a perfectly valid 
> character in a string.
> 

An "UTF-8 octet" is also called a UTF-8 'code unit'. Similarly a "UTF-16 
hextet" is called a UTF-16 'code unit'. An UTF-8 code unit holds a 
Unicode code point if the code point is <128. Otherwise multiple UTF-8 
code units are needed to encode that code point.

The confusion between 'code unit' and 'code point' is a long standing 
one. An "UTF-8 character" is a slighty ambiguous term. Does it a mean a 
UTF-8 code unit, or does it mean an Unicode character/codepoint encoded 
in a UTF-8 sequence?

-- 
Bruno Medeiros - MSc in CS/E student
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D



More information about the Digitalmars-d mailing list