To Walter, about char[] initialization by FF

Unknown W. Brackets unknown at simplemachines.org
Sat Jul 29 15:23:11 PDT 2006


Andrew,

I think it will make a lot more sense if you keep these things in 
mind... (I'm sure you already know all of them, I'm just listing them 
out since they're crucial and must be thought of together):

1. char, wchar, and dchar are separate types.

2. char contains UTF-8 bytes.  It may not contain UTF-16, UCS-2, KOI-8R, 
or any other encoding.  It must contain UTF-8.

3. wchar contains UTF-16.  It is similar to char in every other way (may 
not contain any other encoding than UTF-16, not even UCS-2.)

4. dchar contains UTF-32 code points.  It may not contain any other sort 
of encoding, again.

5. For other encodings, such as ISO-8859-1 or KOI-8R, you should use 
ubyte/byte or some other method.  It is not valid to use char.

6. The FF byte (8-bit octet sequence) may never appear in any valid 
UTF-8 string.  Since char can only contain UTF-8 strings, it represents 
invalid data if it contains such an 8-bit octet.

7. Code points are the characters in Unicode; they are "compressed", so 
to speak, in encodings such as UTF-8 and UTF-16.  USC-2 and USC-4 
(UTF-32) contain full code points.

8. If you were to examine the bytes in a wchar string, it may be 
possible that the 8-bit octet sequence "FF" might show up.  Nonetheless, 
since char cannot be used for UTF-16, this doesn't matter.

9. For the above reason, wchar (UTF-16) uses FFFF.  This character is 
similar to FF for UTF-8.

Given the above, I think I might answer your questions:

1. UTF-8 character here could mean an 8-bit octet of code point.  In 
this case, they are both the same and represent a perfectly valid 
character in a string.

2. ASCII does not matter; char is not ASCII.  It happens that ASCII 
bytes 0 to 127 correspond to the same code points in Unicode, and the 
same characters in UTF-8.

3. It does not matter; KOI-8R encoded strings should not be placed in 
char arrays.  You should use UTF-8 or another encoding for your Russian 
text.

4. If you wish to use KOI-8R (or any other encoding not based on 
Unicode) you should not be using char arrays, which are meant for 
Unicode-related encodings only.

Obviously this is by far different from C, but that's the good thing 
about D in many ways ;).

Thanks,
-[Unknown]



> "Walter Bright" <newshound at digitalmars.com> wrote in message 
> news:eagk1o$1mph$1 at digitaldaemon.com...
>> Andrew Fedoniouk wrote:
>>> Following assumption ( 
>>> http://www.digitalmars.com/d/archives/digitalmars/D/3239.html):
>>>
>>> "codepoint U+FFFF is not a legitimate Unicode character, and, 
>>> furthermore, it is guaranteed by the
>>> Unicode Consortium that 0xFFFF will NEVER be a legitimate Unicode 
>>> character.
>>> This codepoint will remain forever unassigned, precisely so that it may 
>>> be used
>>> for purposes such as this."
>>>
>>> is just wrong.
>>>
>>> 1) 0xFFFF is a valid UNICODE character - it is one of the "Specials" from
>>> R-zone: {U+FFF0..U+FFFF} - region assigned already.
>> "the value FFFF is guaranteed not to be a Unicode character at all"
>> http://www.unicode.org/charts/PDF/UFFF0.pdf
>>
>>
>>> 2) For char[] selection of 0xFF is wrong and even worse.
>>> For example character with code 0xFF in Latin-I encoding is
>>> "y diaeresis". In many European languages and Far East encodings 0xFF is 
>>> a valid code point.
>>> For example in KOI-8 encoding 0xFF is officially assigned value.
>> char[] is not Unicode, it is UTF-8. For UTF-8, 0xFF is not a valid value. 
>> The Unicode U00FF is not encoded into UTF-8 as FF.
>>
>> "The octet values C0, C1, F5 to FF never appear." 
>> http://www.ietf.org/rfc/rfc3629.txt
>>
>>
>>> What is the point of current initializaton?
>> The point is to initialize it with an invalid value, in order to flush out 
>> uninitialized data errors.
>>
>>> If you are doing intialization already
>>> and this intialization is a part of specification so why not to use
>>> official "Nul" values in this case?
>> Because 0 is a valid UTF-8 character.
> 
> 1) What "UTF-8 character" means exactly?
> 2) In ASCII char(0) is officially NUL. Why not to initialize strings
> by null?
> 
>>
>>> You are doing the same for floats - you are using NaNs there
>>>  (Null value for floats). Why not to use the same for chars?
>> The FF initialization does correspond (as close as we can get) with NaN 
>> for floats. 0 can masquerade as legitimate data, FF cannot.
> 
> I don't get it, sorry. In KOI-8R (Russian) enconding 0xFF is letter '?'
> Are you saying that I cannot use char[] to represen russian text in D?
> 
> Andrew Fedoniouk.
> http://terrainformatica.com
> 
> 



More information about the Digitalmars-d mailing list