To Walter, about char[] initialization by FF

Sun Jul 30 10:09:26 PDT 2006

I use that terminology because I've read too many RFCs (consider the FTP 
RFC) - they all say "8-bit octet".  Anyway, I'm trying to be completely 
clear.

Code unit.  Yeah, I knew it was code something but it slipped my mind. 
I was sure that he'd either correct me or 8-bit octet/etc. would remain 
clear.  I hate it when I forget such obvious terms.

Anyway, my point in what you're quoting is very context-dependent. 
Walter mentioned that "0 is a valid UTF-8 character."  Andrew asked what 
this meant, so I explained that in this case (as you also clarified) it 
doesn't make any difference.  Regardless, it's a valid [whatever it is] 
and that meaning is not unclear.

-[Unknown]

> Unknown W. Brackets wrote:
>> 6. The FF byte (8-bit octet sequence) may never appear in any valid 
>> UTF-8 string.  Since char can only contain UTF-8 strings, it 
>> represents invalid data if it contains such an 8-bit octet.
>>
> You mentioned "8-bit octet" repeatedly in various posts. That's 
> redundant: An "octet" is an 8-bit value. There are no "16-bit octets" 
> and no "8-bit hextets" or stuff like that :P . I hope you knew that and 
> were just distracted, but you kept saying that :) .
> 
>> 1. UTF-8 character here could mean an 8-bit octet of code point.  In 
>> this case, they are both the same and represent a perfectly valid 
>> character in a string.
>>
> 
> An "UTF-8 octet" is also called a UTF-8 'code unit'. Similarly a "UTF-16 
> hextet" is called a UTF-16 'code unit'. An UTF-8 code unit holds a 
> Unicode code point if the code point is <128. Otherwise multiple UTF-8 
> code units are needed to encode that code point.
> 
> The confusion between 'code unit' and 'code point' is a long standing 
> one. An "UTF-8 character" is a slighty ambiguous term. Does it a mean a 
> UTF-8 code unit, or does it mean an Unicode character/codepoint encoded 
> in a UTF-8 sequence?
>