To Walter, about char[] initialization by FF
Frits van Bommel
fvbommel at REMwOVExCAPSs.nl
Sat Jul 29 14:19:09 PDT 2006
Andrew Fedoniouk wrote:
> To Walter:
>
> Following assumption (
> http://www.digitalmars.com/d/archives/digitalmars/D/3239.html):
>
> "codepoint U+FFFF is not a legitimate Unicode character, and, furthermore,
> it is guaranteed by the
> Unicode Consortium that 0xFFFF will NEVER be a legitimate Unicode character.
> This codepoint will remain forever unassigned, precisely so that it may be
> used
> for purposes such as this."
>
> is just wrong.
>
> 1) 0xFFFF is a valid UNICODE character - it is one of the "Specials" from
> R-zone: {U+FFF0..U+FFFF} - region assigned already.
Yep, 0xFFFF is in the "Specials" range. In fact, together with 0xFFFE it
forms the subrange of the "Noncharacters" (see
http://www.unicode.org/charts/PDF/UFFF0.pdf, at the end). These are
"intended for process internal uses, but are not permitted for
interchange". 0xFFFF specifically is marked "<not a character> - the
value FFFF if guaranteed not to be a Unicode character at all".
So yes, it's assigned - for exactly such a purpose as D is using it for :).
> 2) For char[] selection of 0xFF is wrong and even worse.
> For example character with code 0xFF in Latin-I encoding is
> "y diaeresis". In many European languages and Far East encodings 0xFF is a
> valid code point.
> For example in KOI-8 encoding 0xFF is officially assigned value.
First of all, non-Unicode encodings are irrelevant. 'char' is a UTF-8
codepoint (I think that's the correct term).
It's not a Unicode character (though some Unicode characters are encoded
as a single UTF-8 codepoint, specifically anything up to 0x80 IIRC).
0xFF is indeed a valid Unicode character, but that doesn't mean that
character is encoded as a byte with value 0xFF in UTF-8 (which char[]s
represent). 0xFF is in fact one of the byte values that *cannot* occur
in a valid UTF-8 text.
More information about the Digitalmars-d
mailing list