To Walter, about char[] initialization by FF
Andrew Fedoniouk
news at terrainformatica.com
Sat Jul 29 16:40:41 PDT 2006
"Unknown W. Brackets" <unknown at simplemachines.org> wrote in message
news:eagn4d$1q1t$1 at digitaldaemon.com...
> Andrew,
>
> I think it will make a lot more sense if you keep these things in mind...
> (I'm sure you already know all of them, I'm just listing them out since
> they're crucial and must be thought of together):
>
> 1. char, wchar, and dchar are separate types.
No objections with this.
>
> 2. char contains UTF-8 bytes. It may not contain UTF-16, UCS-2, KOI-8R,
> or any other encoding. It must contain UTF-8.
Sorry but plural form "char contains UTF-8 bytes" is wrong.
What you think char means:
1) char is an octet (byte) - member of utf-8 sequence -or-
2) char is code point of some character in some character table.
?
Probably I am treating English too literally but
char(acter) is not an UTF-8 byte. And never was.
char is an index of some glyph in some encoding table.
This is common definition used everywhere.
>
> 3. wchar contains UTF-16. It is similar to char in every other way (may
> not contain any other encoding than UTF-16, not even UCS-2.)
The same problem as in #2.
What is wchar (uint16) for you:
1) wchar as is an index of a Unicode scalar value in Basic Multilingual
Plane (BMP)
-or-
2) is a uint16 value - member of UTF-16 sequence.
?
>
> 4. dchar contains UTF-32 code points. It may not contain any other sort
> of encoding, again.
Oh.....
UTF-32 (as any other utfs) is a transformation format -
group name of two different encodings UTF-32BE and UTF-32LE.
UTF-32 code point is a non-sense.
UTF-32 defines of how to encode Unicode code point in
again sequence of four bytes - octets.
I would define this thing as
dchar ( better name is uchar ) is type for representing
full set of Unicode Code Points (21bit value).
Pleas note: "transformation format" (UTF) is not by
any means a "manipulation format".
Representation of text in memory suitable for
manipulation (e.g. text processing) is different as rule.
You cannot use utf-8 encoded russian text for
analysis. No way.
>
> 5. For other encodings, such as ISO-8859-1 or KOI-8R, you should use
> ubyte/byte or some other method. It is not valid to use char.
Vice versa. For utf-8 encoded strings you should use byte[]
and for strings using single byte encodings you should use char.
>
> 6. The FF byte (8-bit octet sequence) may never appear in any valid UTF-8
> string. Since char can only contain UTF-8 strings, it represents invalid
> data if it contains such an 8-bit octet.
No objections with that, for UTF-8 octet sequences 0xFF is invalid
value of octet in the sequence. But please note: in the sequence of octets.
>
> 7. Code points are the characters in Unicode; they are "compressed", so to
> speak, in encodings such as UTF-8 and UTF-16. USC-2 and USC-4 (UTF-32)
> contain full code points.
Sorry, but USC-4 *is not* UTF-32
http://www.unicode.org/reports/tr19/tr19-9.html
I will ask again:
What:
char c = 'a';
means for you?
And following in C/C++:
#pragma(encoding,"KOI-8R")
char c = '?';
?
>
> 8. If you were to examine the bytes in a wchar string, it may be possible
> that the 8-bit octet sequence "FF" might show up. Nonetheless, since char
> cannot be used for UTF-16, this doesn't matter.
Not clear what you mean here. Could you clarify? Especially last statement.
>
> 9. For the above reason, wchar (UTF-16) uses FFFF. This character is
> similar to FF for UTF-8.
>
> Given the above, I think I might answer your questions:
>
> 1. UTF-8 character here could mean an 8-bit octet of code point. In this
> case, they are both the same and represent a perfectly valid character in
> a string.
Sorry I am not buying following:
"UTF-8 character" and "8-bit octet of code point"
>
> 2. ASCII does not matter; char is not ASCII. It happens that ASCII bytes
> 0 to 127 correspond to the same code points in Unicode, and the same
> characters in UTF-8.
"ASCII does not matter"... for whom?
>
> 3. It does not matter; KOI-8R encoded strings should not be placed in char
> arrays. You should use UTF-8 or another encoding for your Russian text.
"You should use UTF-8 or another encoding for your Russian text."
Thanks.
Advice from my side:
Let me know when you will visit Russia.
I will ask representatives of russian developer community and web authors
to meet you.
Advice per se: You should wear a helmet.
>
> 4. If you wish to use KOI-8R (or any other encoding not based on Unicode)
> you should not be using char arrays, which are meant for Unicode-related
> encodings only.
The same advice as above.
>
> Obviously this is by far different from C, but that's the good thing about
> D in many ways ;).
In Israel they have an old saying:
"Not a human for Saturday but Saturday for human".
I do have practical experience in writnig text processing software in
encodings other than "US-ASCII" and have heard your advices about
UTF-8 usage with interest.
Please don't take all of this personal - no intention to harm anybody.
Honestly and with smile :)
Andrew.
More information about the Digitalmars-d
mailing list