To Walter, about char[] initialization by FF

Sun Jul 30 02:33:41 PDT 2006

Derek wrote:
> Andrew seems to be stating ...
> (a) char[] arrays should be allowed to hold encodings other than UTF-8, and
> thus initializing them with hex-FF byte values is not useful.
> (b) UTF-8 encoding is not an efficient encoding for text analysis.
> (c) UTF encodings are not optimized for data transmission (they contain
> redundant data in many contexts).
> (d) The D type called 'char' may not have been the best name to use if it
> is meant to be used to contain only UTF-8 octets.
> 
> I, and many others including Walter, would probably agree to (b), (c) and
> (d). However, considering (b) and (c), UTF has benefits that outweigh these
> issues and there are ways to compensate for these too. Point (d) is a
> casualty of history and to change the language now to rename 'char' to
> anything else would be counter productive now. But feel free to implement
> your own flavour of D.<g>
> 
> Back to point (a)... The fact is, char[] is designed to hold UTF-8
> encodings so don't try to force anything else into such arrays. If you wish
> to use some other encodings, then use a more appropriate data structure for
> it. For example, to hold 'KOI-8' encodings of Russian text, I would
> recommend using ubyte[] instead. To transform char[] to any other encoding
> you will have to provide the functions to do that, as I don't think it is
> Walter's or D's responsibilty to do it. The point of initializing UTF-8
> strings with illegal values is to help detect coding or logical mistakes.
> And a leading octet with the value of hex-FF in a UTF-8 encoded Unicode
> codepoint *is* illegal. If you must store an octet of hex-FF then use
> ubyte[] arrays to do it.

Thank you for the insightful summary of the situation.

I suspect, though, that (c) might be moot since it is my understanding 
that most actual data transmission equipment automatically compresses 
the data stream, and so the redundancy of the UTF-8 is minimized. Text 
itself tends to be highly compressible on top of that.

Furthermore, because of the rate of expansion and declining costs of 
bandwidth, the cost of extra bytes is declining at the same time that 
the cost of the inflexibility of code pages is increasing.