To Walter, about char[] initialization by FF

Sat Jul 29 14:19:09 PDT 2006

Andrew Fedoniouk wrote:
> To Walter:
> 
> Following assumption ( 
> http://www.digitalmars.com/d/archives/digitalmars/D/3239.html):
> 
> "codepoint U+FFFF is not a legitimate Unicode character, and, furthermore, 
> it is guaranteed by the
> Unicode Consortium that 0xFFFF will NEVER be a legitimate Unicode character.
> This codepoint will remain forever unassigned, precisely so that it may be 
> used
> for purposes such as this."
> 
> is just wrong.
> 
> 1) 0xFFFF is a valid UNICODE character - it is one of the "Specials" from
> R-zone: {U+FFF0..U+FFFF} - region assigned already.

Yep, 0xFFFF is in the "Specials" range. In fact, together with 0xFFFE it 
forms the subrange of the "Noncharacters" (see 
http://www.unicode.org/charts/PDF/UFFF0.pdf, at the end). These are 
"intended for process internal uses, but are not permitted for 
interchange". 0xFFFF specifically is marked "<not a character> - the 
value FFFF if guaranteed not to be a Unicode character at all".
So yes, it's assigned - for exactly such a purpose as D is using it for :).

> 2) For char[] selection of 0xFF is wrong and even worse.
> For example character with code 0xFF in Latin-I encoding is
> "y diaeresis". In many European languages and Far East encodings 0xFF is a 
> valid code point.
> For example in KOI-8 encoding 0xFF is officially assigned value.

First of all, non-Unicode encodings are irrelevant. 'char' is a UTF-8 
codepoint (I think that's the correct term).
It's not a Unicode character (though some Unicode characters are encoded 
as a single UTF-8 codepoint, specifically anything up to 0x80 IIRC).
0xFF is indeed a valid Unicode character, but that doesn't mean that 
character is encoded as a byte with value 0xFF in UTF-8 (which char[]s 
represent). 0xFF is in fact one of the byte values that *cannot* occur 
in a valid UTF-8 text.