To Walter, about char[] initialization by FF

Unknown W. Brackets unknown at simplemachines.org
Sat Jul 29 19:07:54 PDT 2006


2. Sorry, an array of char (a single char is one single 8 bit octet) 
contains UTF-8 bytes which are 8-bit octets.

A single character, in UTF-8 encoding, may be 1 byte, 2 bytes, etc. 
Thus, one char MAY NOT hold every single Unicode code point.  You may 
need an array of multiple chars (bytes) to hold a single code point.

This is not what it means to me; this is what it means.  A char is a 
single 8-bit octet in a UTF-8 sequence.  They ARE NOT by any means code 
points.

I'm sorry that I did not specify "array", but I fear you are being 
pedantic here; I'm sure you knew what I meant.

A char is a single byte in a UTF-8 sequence.  I'm afraid I think calling 
it an index to a glyph is dangerous, because it could be mistaken. 
Again, a single char CANNOT represent code points above and including 
128 because it is only ONE byte.

A single char therefore may not represent a glyph all of the time, but 
rather will represent a byte in the sequence of UTF-8 which may be used 
to decode (along with other necessary bytes) the entirity of the code point.

I hope I'm not being overly pedantic here, but I think your definition 
is either lax or wrong.  But, that is only by its reading in English.

3. It is #2, as above.  wchars are not UCS-2.  They cannot always 
represent full code points alone.  Arrays of wchars must be used for 
some code points.  As I read your question, #1 is UCS-2 (fixed length 
16-bit encoding) and #2 is UTF-16 (dynamic length, 16-bit baseline 
encoding.)

4. I was ignoring endianess issues for simplicity.  My point here is 
that a UTF-32 character directly represents  a code point.  Sorry again 
for the non-pedantic laxness in my wording.

5. Wrong.  There is no vice versa.  You may use byte or ubyte arrays for 
your UTF-8 encoded strings and so forth.

In case you didn't realize I was trying to say this:

*char is not for single byte encodings.  char is ONLY for UTF-8.  char 
may not be used for any other encoding unless you wish to have problems. 
  char is not the same as in other languages, e.g. C.*

If you wish for a 8-bit octet value (such as a character in any 
encoding; single byte or otherwise) you should not be using a char. 
That is not a correct usage for them, that is what byte and ubyte are for.

It is expected that chars in an array will follow a specific sequence; 
that is, that they will be encoded in UTF-8.  It is not possible to 
guarantee this if you use other encodings, which is why writefln() will 
fail in such cases.

6.  Correct.  And a single char (8-bit octet in a sequence of UTF-8 
octets encoded such) may never be FF because no single 8-bit octet 
anywhere in a valid UTF-8 sequence may be FF.  Remember, char is not a 
code point.  It is a single 8-bit octet in a sequence.

7. My mistake.  I always consider them roughly the same (and for some 
reason I thought that they had been made the same; but I assume your 
link is current.)

Your first code sample defines a single UTF-8 character, 'a'.  It is 
lucky you did not try:

char c = '蝿';

(hopefully this character gets sent through to you properly; I will be 
sending this message UTF-8 if my client allows it.)

Because that would have failed.  A char cannot hold such a character, 
which has a code point outside the range 0 - 127.  You would either need 
to use an array of chars, or etc.

Your second example means nothing to me.  I don't really care for such 
pragmas or putting untranslated text directly in source code, and have 
never dealt with it.

8. You may not use a single char or an array of chars to represent 
UTF-16.  It may only represent UTF-8.  If you wish to use UTF-16, you 
must use wchars.

1 (the second #1): but for the code point 0, as encoded in UTF-8, they 
are the same - do you not agree?  A 0 is a zero is a zero.  It doesn't 
matter what he means.

2 (the second): rules about ASCII do not apply to char.  Just as rules 
in Portugal do not dissuade me here in Los Angeles.

3 (the second): I have lead the development of a multi-lingual software 
which was used by quite a large sum of people.  I also helped 
coordinate, and later interface with the assigned coordinator of 
translation.  This software was translated into Thai, Chinese (simple 
and traditional), Russian, Italian, Spanish, Japanese, Catalan, and 
several other languages.  More than twenty anyway.

At first I was suggesting that everyone use their own encoding and 
handling that (sometimes painfully) in the code.  I would sometimes get 
comments about using Unicode instead (from the translators who would 
have preferred this.)  This software now uses UTF-8 and remains 
translated in these languages.

So, while I have not been to Russia (although I have worked with 
numerous Russian developers, consumers, and translators) I would tend to 
disagree with your assertion.  Also I do not like helmets.

Obviously, I mean nothing to be taken personally as well; we are only 
talking about UTF-8, Unicode, its usage in D, and being pedantic ;). 
And helmets, we touched that subject too.  But not about each other, really.

Thanks,
-[Unknown]


> "Unknown W. Brackets" <unknown at simplemachines.org> wrote in message 
> news:eagn4d$1q1t$1 at digitaldaemon.com...
>> Andrew,
>>
>> I think it will make a lot more sense if you keep these things in mind... 
>> (I'm sure you already know all of them, I'm just listing them out since 
>> they're crucial and must be thought of together):
>>
>> 1. char, wchar, and dchar are separate types.
> 
> No objections with this.
> 
>> 2. char contains UTF-8 bytes.  It may not contain UTF-16, UCS-2, KOI-8R, 
>> or any other encoding.  It must contain UTF-8.
> 
> Sorry but plural form "char contains UTF-8 bytes" is wrong.
> 
> What you think char means:
> 1) char is an octet (byte) - member of utf-8 sequence -or-
> 2) char is code point of some character in some character table.
> 
> ?
> 
> Probably I am treating English too literally but
> char(acter) is not an UTF-8 byte.  And never was.
> 
> char is an index of some glyph in some encoding table.
> This is common definition used everywhere.
> 
>> 3. wchar contains UTF-16.  It is similar to char in every other way (may 
>> not contain any other encoding than UTF-16, not even UCS-2.)
> 
> The same problem as in #2.
> 
> What is wchar (uint16) for you:
> 1) wchar as is an index of a Unicode scalar value in Basic Multilingual 
> Plane (BMP)
> -or-
> 2) is a uint16 value - member of UTF-16 sequence.
> 
> ?
> 
>> 4. dchar contains UTF-32 code points.  It may not contain any other sort 
>> of encoding, again.
> 
> Oh.....
> 
> UTF-32 (as any other utfs) is a transformation format -
> group name of two different encodings UTF-32BE and UTF-32LE.
> 
> UTF-32 code point is a non-sense.
> 
> UTF-32 defines of how to encode Unicode code point  in
> again sequence of four bytes - octets.
> 
> I would define this thing as
> 
> dchar ( better name is uchar ) is type for representing
> full set of Unicode Code Points (21bit value).
> 
> Pleas note: "transformation format" (UTF) is not by
> any means a "manipulation format".
> 
> Representation of text in memory suitable for
> manipulation (e.g. text processing) is different as rule.
> 
> You cannot use utf-8 encoded russian text for
> analysis. No way.
> 
>> 5. For other encodings, such as ISO-8859-1 or KOI-8R, you should use 
>> ubyte/byte or some other method.  It is not valid to use char.
> 
> Vice versa. For utf-8 encoded strings you should use byte[]
> and for strings using single byte encodings you should use char.
> 
>> 6. The FF byte (8-bit octet sequence) may never appear in any valid UTF-8 
>> string.  Since char can only contain UTF-8 strings, it represents invalid 
>> data if it contains such an 8-bit octet.
> 
> No objections with that, for UTF-8 octet sequences 0xFF is invalid
> value of octet in the sequence. But please note: in the sequence of octets.
> 
>> 7. Code points are the characters in Unicode; they are "compressed", so to 
>> speak, in encodings such as UTF-8 and UTF-16.  USC-2 and USC-4 (UTF-32) 
>> contain full code points.
> 
> Sorry, but USC-4 *is not* UTF-32
> http://www.unicode.org/reports/tr19/tr19-9.html
> 
> I will ask again:
> 
> What:
> char c = 'a';
> means for you?
> 
> And following in C/C++:
> 
> #pragma(encoding,"KOI-8R")
> 
> char c = '?';
> 
> ?
> 
> 
>> 8. If you were to examine the bytes in a wchar string, it may be possible 
>> that the 8-bit octet sequence "FF" might show up.  Nonetheless, since char 
>> cannot be used for UTF-16, this doesn't matter.
> 
> Not clear what you mean here. Could you clarify? Especially last statement.
> 
>> 9. For the above reason, wchar (UTF-16) uses FFFF.  This character is 
>> similar to FF for UTF-8.
>>
>> Given the above, I think I might answer your questions:
>>
>> 1. UTF-8 character here could mean an 8-bit octet of code point.  In this 
>> case, they are both the same and represent a perfectly valid character in 
>> a string.
> 
> Sorry I am not buying following:
> "UTF-8 character" and "8-bit octet of code point"
> 
>> 2. ASCII does not matter; char is not ASCII.  It happens that ASCII bytes 
>> 0 to 127 correspond to the same code points in Unicode, and the same 
>> characters in UTF-8.
> 
> "ASCII does not matter"... for whom?
> 
>> 3. It does not matter; KOI-8R encoded strings should not be placed in char 
>> arrays.  You should use UTF-8 or another encoding for your Russian text.
> 
> "You should use UTF-8 or another encoding for your Russian text."
> 
> Thanks.
> 
> Advice from my side:
> Let me know when you will visit Russia.
> I will ask representatives of russian developer community and web authors
> to meet you.
> 
> Advice per se: You should wear a helmet.
> 
>> 4. If you wish to use KOI-8R (or any other encoding not based on Unicode) 
>> you should not be using char arrays, which are meant for Unicode-related 
>> encodings only.
> 
> The same advice as above.
> 
>> Obviously this is by far different from C, but that's the good thing about 
>> D in many ways ;).
> 
> In Israel they have an old saying:
> "Not a human for Saturday but Saturday for human".
> 
> I do have practical experience in writnig text processing software in
> encodings other than "US-ASCII" and have heard your advices about
> UTF-8 usage with interest.
> 
> Please don't take all of this personal - no intention to harm anybody.
> Honestly and with smile :)
> 
> Andrew.
> 
> 



More information about the Digitalmars-d mailing list