To Walter, about char[] initialization by FF

Sat Jul 29 15:21:23 PDT 2006

"Frits van Bommel" <fvbommel at REMwOVExCAPSs.nl> wrote in message 
news:eagjcd$1m1t$1 at digitaldaemon.com...
> Andrew Fedoniouk wrote:
>> To Walter:
>>
>> Following assumption ( 
>> http://www.digitalmars.com/d/archives/digitalmars/D/3239.html):
>>
>> "codepoint U+FFFF is not a legitimate Unicode character, and, 
>> furthermore, it is guaranteed by the
>> Unicode Consortium that 0xFFFF will NEVER be a legitimate Unicode 
>> character.
>> This codepoint will remain forever unassigned, precisely so that it may 
>> be used
>> for purposes such as this."
>>
>> is just wrong.
>>
>> 1) 0xFFFF is a valid UNICODE character - it is one of the "Specials" from
>> R-zone: {U+FFF0..U+FFFF} - region assigned already.
>
> Yep, 0xFFFF is in the "Specials" range. In fact, together with 0xFFFE it 
> forms the subrange of the "Noncharacters" (see 
> http://www.unicode.org/charts/PDF/UFFF0.pdf, at the end). These are 
> "intended for process internal uses, but are not permitted for 
> interchange". 0xFFFF specifically is marked "<not a character> - the value 
> FFFF if guaranteed not to be a Unicode character at all".
> So yes, it's assigned - for exactly such a purpose as D is using it for 
> :).
>
>> 2) For char[] selection of 0xFF is wrong and even worse.
>> For example character with code 0xFF in Latin-I encoding is
>> "y diaeresis". In many European languages and Far East encodings 0xFF is 
>> a valid code point.
>> For example in KOI-8 encoding 0xFF is officially assigned value.
>
> First of all, non-Unicode encodings are irrelevant. 'char' is a UTF-8 
> codepoint (I think that's the correct term).

Sorry but this is wrong. "UTF-8 codepoint" is a non-sense.

In common practice Code Point is a: (1) A numerical index (or position)
in an encoding table used for encoding characters.
(2) Synonym for Unicode scalar value.

As rule one code point represented by single glyph while represented
to human.

> It's not a Unicode character (though some Unicode characters are encoded 
> as a single UTF-8 codepoint, specifically anything up to 0x80 IIRC).
> 0xFF is indeed a valid Unicode character, but that doesn't mean that 
> character is encoded as a byte with value 0xFF in UTF-8 (which char[]s 
> represent). 0xFF is in fact one of the byte values that *cannot* occur in 
> a valid UTF-8 text.

Sorry, but element of UTF-8 encoded sequence is a byte (octet) and
not a char. char as a type historically means type for storing
character code points. 0xFF is assigned and legal value in many encodings.

Either use different name for this "D char" - let's say utf8byte or
use char in the meaning "code point value" - thus initialize it by
NUL value common for all known encodings.

Andrew Fedoniouk.
http://terrainformatica.com