To Walter, about char[] initialization by FF
Unknown W. Brackets
unknown at simplemachines.org
Sat Jul 29 19:07:54 PDT 2006
2. Sorry, an array of char (a single char is one single 8 bit octet)
contains UTF-8 bytes which are 8-bit octets.
A single character, in UTF-8 encoding, may be 1 byte, 2 bytes, etc.
Thus, one char MAY NOT hold every single Unicode code point. You may
need an array of multiple chars (bytes) to hold a single code point.
This is not what it means to me; this is what it means. A char is a
single 8-bit octet in a UTF-8 sequence. They ARE NOT by any means code
points.
I'm sorry that I did not specify "array", but I fear you are being
pedantic here; I'm sure you knew what I meant.
A char is a single byte in a UTF-8 sequence. I'm afraid I think calling
it an index to a glyph is dangerous, because it could be mistaken.
Again, a single char CANNOT represent code points above and including
128 because it is only ONE byte.
A single char therefore may not represent a glyph all of the time, but
rather will represent a byte in the sequence of UTF-8 which may be used
to decode (along with other necessary bytes) the entirity of the code point.
I hope I'm not being overly pedantic here, but I think your definition
is either lax or wrong. But, that is only by its reading in English.
3. It is #2, as above. wchars are not UCS-2. They cannot always
represent full code points alone. Arrays of wchars must be used for
some code points. As I read your question, #1 is UCS-2 (fixed length
16-bit encoding) and #2 is UTF-16 (dynamic length, 16-bit baseline
encoding.)
4. I was ignoring endianess issues for simplicity. My point here is
that a UTF-32 character directly represents a code point. Sorry again
for the non-pedantic laxness in my wording.
5. Wrong. There is no vice versa. You may use byte or ubyte arrays for
your UTF-8 encoded strings and so forth.
In case you didn't realize I was trying to say this:
*char is not for single byte encodings. char is ONLY for UTF-8. char
may not be used for any other encoding unless you wish to have problems.
char is not the same as in other languages, e.g. C.*
If you wish for a 8-bit octet value (such as a character in any
encoding; single byte or otherwise) you should not be using a char.
That is not a correct usage for them, that is what byte and ubyte are for.
It is expected that chars in an array will follow a specific sequence;
that is, that they will be encoded in UTF-8. It is not possible to
guarantee this if you use other encodings, which is why writefln() will
fail in such cases.
6. Correct. And a single char (8-bit octet in a sequence of UTF-8
octets encoded such) may never be FF because no single 8-bit octet
anywhere in a valid UTF-8 sequence may be FF. Remember, char is not a
code point. It is a single 8-bit octet in a sequence.
7. My mistake. I always consider them roughly the same (and for some
reason I thought that they had been made the same; but I assume your
link is current.)
Your first code sample defines a single UTF-8 character, 'a'. It is
lucky you did not try:
char c = '蝿';
(hopefully this character gets sent through to you properly; I will be
sending this message UTF-8 if my client allows it.)
Because that would have failed. A char cannot hold such a character,
which has a code point outside the range 0 - 127. You would either need
to use an array of chars, or etc.
Your second example means nothing to me. I don't really care for such
pragmas or putting untranslated text directly in source code, and have
never dealt with it.
8. You may not use a single char or an array of chars to represent
UTF-16. It may only represent UTF-8. If you wish to use UTF-16, you
must use wchars.
1 (the second #1): but for the code point 0, as encoded in UTF-8, they
are the same - do you not agree? A 0 is a zero is a zero. It doesn't
matter what he means.
2 (the second): rules about ASCII do not apply to char. Just as rules
in Portugal do not dissuade me here in Los Angeles.
3 (the second): I have lead the development of a multi-lingual software
which was used by quite a large sum of people. I also helped
coordinate, and later interface with the assigned coordinator of
translation. This software was translated into Thai, Chinese (simple
and traditional), Russian, Italian, Spanish, Japanese, Catalan, and
several other languages. More than twenty anyway.
At first I was suggesting that everyone use their own encoding and
handling that (sometimes painfully) in the code. I would sometimes get
comments about using Unicode instead (from the translators who would
have preferred this.) This software now uses UTF-8 and remains
translated in these languages.
So, while I have not been to Russia (although I have worked with
numerous Russian developers, consumers, and translators) I would tend to
disagree with your assertion. Also I do not like helmets.
Obviously, I mean nothing to be taken personally as well; we are only
talking about UTF-8, Unicode, its usage in D, and being pedantic ;).
And helmets, we touched that subject too. But not about each other, really.
Thanks,
-[Unknown]
> "Unknown W. Brackets" <unknown at simplemachines.org> wrote in message
> news:eagn4d$1q1t$1 at digitaldaemon.com...
>> Andrew,
>>
>> I think it will make a lot more sense if you keep these things in mind...
>> (I'm sure you already know all of them, I'm just listing them out since
>> they're crucial and must be thought of together):
>>
>> 1. char, wchar, and dchar are separate types.
>
> No objections with this.
>
>> 2. char contains UTF-8 bytes. It may not contain UTF-16, UCS-2, KOI-8R,
>> or any other encoding. It must contain UTF-8.
>
> Sorry but plural form "char contains UTF-8 bytes" is wrong.
>
> What you think char means:
> 1) char is an octet (byte) - member of utf-8 sequence -or-
> 2) char is code point of some character in some character table.
>
> ?
>
> Probably I am treating English too literally but
> char(acter) is not an UTF-8 byte. And never was.
>
> char is an index of some glyph in some encoding table.
> This is common definition used everywhere.
>
>> 3. wchar contains UTF-16. It is similar to char in every other way (may
>> not contain any other encoding than UTF-16, not even UCS-2.)
>
> The same problem as in #2.
>
> What is wchar (uint16) for you:
> 1) wchar as is an index of a Unicode scalar value in Basic Multilingual
> Plane (BMP)
> -or-
> 2) is a uint16 value - member of UTF-16 sequence.
>
> ?
>
>> 4. dchar contains UTF-32 code points. It may not contain any other sort
>> of encoding, again.
>
> Oh.....
>
> UTF-32 (as any other utfs) is a transformation format -
> group name of two different encodings UTF-32BE and UTF-32LE.
>
> UTF-32 code point is a non-sense.
>
> UTF-32 defines of how to encode Unicode code point in
> again sequence of four bytes - octets.
>
> I would define this thing as
>
> dchar ( better name is uchar ) is type for representing
> full set of Unicode Code Points (21bit value).
>
> Pleas note: "transformation format" (UTF) is not by
> any means a "manipulation format".
>
> Representation of text in memory suitable for
> manipulation (e.g. text processing) is different as rule.
>
> You cannot use utf-8 encoded russian text for
> analysis. No way.
>
>> 5. For other encodings, such as ISO-8859-1 or KOI-8R, you should use
>> ubyte/byte or some other method. It is not valid to use char.
>
> Vice versa. For utf-8 encoded strings you should use byte[]
> and for strings using single byte encodings you should use char.
>
>> 6. The FF byte (8-bit octet sequence) may never appear in any valid UTF-8
>> string. Since char can only contain UTF-8 strings, it represents invalid
>> data if it contains such an 8-bit octet.
>
> No objections with that, for UTF-8 octet sequences 0xFF is invalid
> value of octet in the sequence. But please note: in the sequence of octets.
>
>> 7. Code points are the characters in Unicode; they are "compressed", so to
>> speak, in encodings such as UTF-8 and UTF-16. USC-2 and USC-4 (UTF-32)
>> contain full code points.
>
> Sorry, but USC-4 *is not* UTF-32
> http://www.unicode.org/reports/tr19/tr19-9.html
>
> I will ask again:
>
> What:
> char c = 'a';
> means for you?
>
> And following in C/C++:
>
> #pragma(encoding,"KOI-8R")
>
> char c = '?';
>
> ?
>
>
>> 8. If you were to examine the bytes in a wchar string, it may be possible
>> that the 8-bit octet sequence "FF" might show up. Nonetheless, since char
>> cannot be used for UTF-16, this doesn't matter.
>
> Not clear what you mean here. Could you clarify? Especially last statement.
>
>> 9. For the above reason, wchar (UTF-16) uses FFFF. This character is
>> similar to FF for UTF-8.
>>
>> Given the above, I think I might answer your questions:
>>
>> 1. UTF-8 character here could mean an 8-bit octet of code point. In this
>> case, they are both the same and represent a perfectly valid character in
>> a string.
>
> Sorry I am not buying following:
> "UTF-8 character" and "8-bit octet of code point"
>
>> 2. ASCII does not matter; char is not ASCII. It happens that ASCII bytes
>> 0 to 127 correspond to the same code points in Unicode, and the same
>> characters in UTF-8.
>
> "ASCII does not matter"... for whom?
>
>> 3. It does not matter; KOI-8R encoded strings should not be placed in char
>> arrays. You should use UTF-8 or another encoding for your Russian text.
>
> "You should use UTF-8 or another encoding for your Russian text."
>
> Thanks.
>
> Advice from my side:
> Let me know when you will visit Russia.
> I will ask representatives of russian developer community and web authors
> to meet you.
>
> Advice per se: You should wear a helmet.
>
>> 4. If you wish to use KOI-8R (or any other encoding not based on Unicode)
>> you should not be using char arrays, which are meant for Unicode-related
>> encodings only.
>
> The same advice as above.
>
>> Obviously this is by far different from C, but that's the good thing about
>> D in many ways ;).
>
> In Israel they have an old saying:
> "Not a human for Saturday but Saturday for human".
>
> I do have practical experience in writnig text processing software in
> encodings other than "US-ASCII" and have heard your advices about
> UTF-8 usage with interest.
>
> Please don't take all of this personal - no intention to harm anybody.
> Honestly and with smile :)
>
> Andrew.
>
>
More information about the Digitalmars-d
mailing list