To Walter, about char[] initialization by FF

Andrew Fedoniouk news at terrainformatica.com
Sat Jul 29 16:40:41 PDT 2006


"Unknown W. Brackets" <unknown at simplemachines.org> wrote in message 
news:eagn4d$1q1t$1 at digitaldaemon.com...
> Andrew,
>
> I think it will make a lot more sense if you keep these things in mind... 
> (I'm sure you already know all of them, I'm just listing them out since 
> they're crucial and must be thought of together):
>
> 1. char, wchar, and dchar are separate types.

No objections with this.

>
> 2. char contains UTF-8 bytes.  It may not contain UTF-16, UCS-2, KOI-8R, 
> or any other encoding.  It must contain UTF-8.

Sorry but plural form "char contains UTF-8 bytes" is wrong.

What you think char means:
1) char is an octet (byte) - member of utf-8 sequence -or-
2) char is code point of some character in some character table.

?

Probably I am treating English too literally but
char(acter) is not an UTF-8 byte.  And never was.

char is an index of some glyph in some encoding table.
This is common definition used everywhere.

>
> 3. wchar contains UTF-16.  It is similar to char in every other way (may 
> not contain any other encoding than UTF-16, not even UCS-2.)

The same problem as in #2.

What is wchar (uint16) for you:
1) wchar as is an index of a Unicode scalar value in Basic Multilingual 
Plane (BMP)
-or-
2) is a uint16 value - member of UTF-16 sequence.

?

>
> 4. dchar contains UTF-32 code points.  It may not contain any other sort 
> of encoding, again.

Oh.....

UTF-32 (as any other utfs) is a transformation format -
group name of two different encodings UTF-32BE and UTF-32LE.

UTF-32 code point is a non-sense.

UTF-32 defines of how to encode Unicode code point  in
again sequence of four bytes - octets.

I would define this thing as

dchar ( better name is uchar ) is type for representing
full set of Unicode Code Points (21bit value).

Pleas note: "transformation format" (UTF) is not by
any means a "manipulation format".

Representation of text in memory suitable for
manipulation (e.g. text processing) is different as rule.

You cannot use utf-8 encoded russian text for
analysis. No way.

>
> 5. For other encodings, such as ISO-8859-1 or KOI-8R, you should use 
> ubyte/byte or some other method.  It is not valid to use char.

Vice versa. For utf-8 encoded strings you should use byte[]
and for strings using single byte encodings you should use char.

>
> 6. The FF byte (8-bit octet sequence) may never appear in any valid UTF-8 
> string.  Since char can only contain UTF-8 strings, it represents invalid 
> data if it contains such an 8-bit octet.

No objections with that, for UTF-8 octet sequences 0xFF is invalid
value of octet in the sequence. But please note: in the sequence of octets.

>
> 7. Code points are the characters in Unicode; they are "compressed", so to 
> speak, in encodings such as UTF-8 and UTF-16.  USC-2 and USC-4 (UTF-32) 
> contain full code points.

Sorry, but USC-4 *is not* UTF-32
http://www.unicode.org/reports/tr19/tr19-9.html

I will ask again:

What:
char c = 'a';
means for you?

And following in C/C++:

#pragma(encoding,"KOI-8R")

char c = '?';

?


>
> 8. If you were to examine the bytes in a wchar string, it may be possible 
> that the 8-bit octet sequence "FF" might show up.  Nonetheless, since char 
> cannot be used for UTF-16, this doesn't matter.

Not clear what you mean here. Could you clarify? Especially last statement.

>
> 9. For the above reason, wchar (UTF-16) uses FFFF.  This character is 
> similar to FF for UTF-8.
>
> Given the above, I think I might answer your questions:
>
> 1. UTF-8 character here could mean an 8-bit octet of code point.  In this 
> case, they are both the same and represent a perfectly valid character in 
> a string.

Sorry I am not buying following:
"UTF-8 character" and "8-bit octet of code point"

>
> 2. ASCII does not matter; char is not ASCII.  It happens that ASCII bytes 
> 0 to 127 correspond to the same code points in Unicode, and the same 
> characters in UTF-8.

"ASCII does not matter"... for whom?

>
> 3. It does not matter; KOI-8R encoded strings should not be placed in char 
> arrays.  You should use UTF-8 or another encoding for your Russian text.

"You should use UTF-8 or another encoding for your Russian text."

Thanks.

Advice from my side:
Let me know when you will visit Russia.
I will ask representatives of russian developer community and web authors
to meet you.

Advice per se: You should wear a helmet.

>
> 4. If you wish to use KOI-8R (or any other encoding not based on Unicode) 
> you should not be using char arrays, which are meant for Unicode-related 
> encodings only.

The same advice as above.

>
> Obviously this is by far different from C, but that's the good thing about 
> D in many ways ;).

In Israel they have an old saying:
"Not a human for Saturday but Saturday for human".

I do have practical experience in writnig text processing software in
encodings other than "US-ASCII" and have heard your advices about
UTF-8 usage with interest.

Please don't take all of this personal - no intention to harm anybody.
Honestly and with smile :)

Andrew.





More information about the Digitalmars-d mailing list