To Walter, about char[] initialization by FF

Andrew Fedoniouk news at terrainformatica.com
Sat Jul 29 19:51:05 PDT 2006


"Unknown W. Brackets" <unknown at simplemachines.org> wrote in message 
news:eah49h$2pi8$1 at digitaldaemon.com...
> 2. Sorry, an array of char (a single char is one single 8 bit octet) 
> contains UTF-8 bytes which are 8-bit octets.
>
> A single character, in UTF-8 encoding, may be 1 byte, 2 bytes, etc. Thus, 
> one char MAY NOT hold every single Unicode code point.  You may need an 
> array of multiple chars (bytes) to hold a single code point.
>
> This is not what it means to me; this is what it means.  A char is a 
> single 8-bit octet in a UTF-8 sequence.  They ARE NOT by any means code 
> points.
>
> I'm sorry that I did not specify "array", but I fear you are being 
> pedantic here; I'm sure you knew what I meant.
>
> A char is a single byte in a UTF-8 sequence.  I'm afraid I think calling 
> it an index to a glyph is dangerous, because it could be mistaken. Again, 
> a single char CANNOT represent code points above and including 128 because 
> it is only ONE byte.
>
> A single char therefore may not represent a glyph all of the time, but 
> rather will represent a byte in the sequence of UTF-8 which may be used to 
> decode (along with other necessary bytes) the entirity of the code point.
>
> I hope I'm not being overly pedantic here, but I think your definition is 
> either lax or wrong.  But, that is only by its reading in English.

"your definition is either lax or wrong"

Which one?

>
> 3. It is #2, as above.  wchars are not UCS-2.  They cannot always 
> represent full code points alone.  Arrays of wchars must be used for some 
> code points.  As I read your question, #1 is UCS-2 (fixed length 16-bit 
> encoding) and #2 is UTF-16 (dynamic length, 16-bit baseline encoding.)
>
> 4. I was ignoring endianess issues for simplicity.  My point here is that 
> a UTF-32 character directly represents  a code point.  Sorry again for the 
> non-pedantic laxness in my wording.

>
> 5. Wrong.  There is no vice versa.  You may use byte or ubyte arrays for 
> your UTF-8 encoded strings and so forth.
>
> In case you didn't realize I was trying to say this:
>
> *char is not for single byte encodings.  char is ONLY for UTF-8.  char may 
> not be used for any other encoding unless you wish to have problems. char 
> is not the same as in other languages, e.g. C.*
>
> If you wish for a 8-bit octet value (such as a character in any encoding; 
> single byte or otherwise) you should not be using a char. That is not a 
> correct usage for them, that is what byte and ubyte are for.
>
> It is expected that chars in an array will follow a specific sequence; 
> that is, that they will be encoded in UTF-8.  It is not possible to 
> guarantee this if you use other encodings, which is why writefln() will 
> fail in such cases.
>
> 6.  Correct.  And a single char (8-bit octet in a sequence of UTF-8 octets 
> encoded such) may never be FF because no single 8-bit octet anywhere in a 
> valid UTF-8 sequence may be FF.  Remember, char is not a code point.  It 
> is a single 8-bit octet in a sequence.
>
> 7. My mistake.  I always consider them roughly the same (and for some 
> reason I thought that they had been made the same; but I assume your link 
> is current.)
>
> Your first code sample defines a single UTF-8 character, 'a'.  It is lucky 
> you did not try:
>
> char c = '?';
>
> (hopefully this character gets sent through to you properly; I will be 
> sending this message UTF-8 if my client allows it.)
>
> Because that would have failed.  A char cannot hold such a character, 
> which has a code point outside the range 0 - 127.  You would either need 
> to use an array of chars, or etc.
>
> Your second example means nothing to me.  I don't really care for such 
> pragmas or putting untranslated text directly in source code, and have 
> never dealt with it.
>
> 8. You may not use a single char or an array of chars to represent UTF-16. 
> It may only represent UTF-8.  If you wish to use UTF-16, you must use 
> wchars.
>
> 1 (the second #1): but for the code point 0, as encoded in UTF-8, they are 
> the same - do you not agree?  A 0 is a zero is a zero.  It doesn't matter 
> what he means.
>
> 2 (the second): rules about ASCII do not apply to char.  Just as rules in 
> Portugal do not dissuade me here in Los Angeles.
>
> 3 (the second): I have lead the development of a multi-lingual software 
> which was used by quite a large sum of people.  I also helped coordinate, 
> and later interface with the assigned coordinator of translation.  This 
> software was translated into Thai, Chinese (simple and traditional), 
> Russian, Italian, Spanish, Japanese, Catalan, and several other languages. 
> More than twenty anyway.
>
> At first I was suggesting that everyone use their own encoding and 
> handling that (sometimes painfully) in the code.  I would sometimes get 
> comments about using Unicode instead (from the translators who would have 
> preferred this.)  This software now uses UTF-8 and remains translated in 
> these languages.
>
> So, while I have not been to Russia (although I have worked with numerous 
> Russian developers, consumers, and translators) I would tend to disagree 
> with your assertion.  Also I do not like helmets.
>
> Obviously, I mean nothing to be taken personally as well; we are only 
> talking about UTF-8, Unicode, its usage in D, and being pedantic ;). And 
> helmets, we touched that subject too.  But not about each other, really.
>
> Thanks,
> -[Unknown]
>

Ok. Let's make second round

Some defintions:

Unicode Code Point is an integer value (21bit used) - index in
global Unicode table.
Such global encoding table maintained by international Unicode Consortium.
With some exceptions each code point there has correspondent
glyph in "global super font".

There are two types of encodings used for Unicode Code Points:
1) transport encodings - example UTF. Main purpose - transport/transfer.
2) manipulation encodings - mapping of ranges of  Unicode Code Points
to diapasons 0..0xFF, 0..0xFFFF and 0..0xFFFFFFFF.

Transport encodings are used for transfer and long term storage of
character data - texts.

Manipulation encoding are used in programming for effective implementation
of text processing functions.
As a rule manipulation encoding maps some fragment (or two) of
Unicode Code Point set to the range 0..0xFF and 0..0xFFFF.
Main charcteristic of such mapping: each value of character vector (string)
there is in 1:1 relationship with the correspondent codepoint in
Unicode set.
Main idea of such encoding - character at some index in string (vector)
represents one code point in full.

I think that motivation of having manipulation encodings is simple
and everyone understands it.
Think about how you will implement caret positioning in editbox
for example.

So statement: "char[] in D supposed to hold only UTF-8 encoded text"
immediately leads us to "D is not designed for effective text processing".

Is this logic clear?

Again - let char be a char in D as it is now. Just don't initialize it
by 0xFF please. And let us be a bit carefull with our utf-8 expectations -
yes, it is almost ideal transport encoding, but it is completely useless
for text manipulation purposes - too expensive.

(last message on the subject)

Andrew Fedoniouk.
http://terrainformatica.com





More information about the Digitalmars-d mailing list