To Walter, about char[] initialization by FF
Unknown W. Brackets
unknown at simplemachines.org
Sun Jul 30 10:09:26 PDT 2006
I use that terminology because I've read too many RFCs (consider the FTP
RFC) - they all say "8-bit octet". Anyway, I'm trying to be completely
clear.
Code unit. Yeah, I knew it was code something but it slipped my mind.
I was sure that he'd either correct me or 8-bit octet/etc. would remain
clear. I hate it when I forget such obvious terms.
Anyway, my point in what you're quoting is very context-dependent.
Walter mentioned that "0 is a valid UTF-8 character." Andrew asked what
this meant, so I explained that in this case (as you also clarified) it
doesn't make any difference. Regardless, it's a valid [whatever it is]
and that meaning is not unclear.
-[Unknown]
> Unknown W. Brackets wrote:
>> 6. The FF byte (8-bit octet sequence) may never appear in any valid
>> UTF-8 string. Since char can only contain UTF-8 strings, it
>> represents invalid data if it contains such an 8-bit octet.
>>
> You mentioned "8-bit octet" repeatedly in various posts. That's
> redundant: An "octet" is an 8-bit value. There are no "16-bit octets"
> and no "8-bit hextets" or stuff like that :P . I hope you knew that and
> were just distracted, but you kept saying that :) .
>
>> 1. UTF-8 character here could mean an 8-bit octet of code point. In
>> this case, they are both the same and represent a perfectly valid
>> character in a string.
>>
>
> An "UTF-8 octet" is also called a UTF-8 'code unit'. Similarly a "UTF-16
> hextet" is called a UTF-16 'code unit'. An UTF-8 code unit holds a
> Unicode code point if the code point is <128. Otherwise multiple UTF-8
> code units are needed to encode that code point.
>
> The confusion between 'code unit' and 'code point' is a long standing
> one. An "UTF-8 character" is a slighty ambiguous term. Does it a mean a
> UTF-8 code unit, or does it mean an Unicode character/codepoint encoded
> in a UTF-8 sequence?
>
More information about the Digitalmars-d
mailing list