To Walter, about char[] initialization by FF

Sun Jul 30 01:51:31 PDT 2006

On Sat, 29 Jul 2006 13:27:14 -0700, Andrew Fedoniouk wrote:

> ... but this is far from concept of null codepoint in character encodings.

Andrew and others,
I've read through these posts a few times now, trying to understand the
various points of view being presented. I keep getting the feeling that
some people are deliberately trying *not* to understand what other people
are saying. This is a sad situation.

Andrew seems to be stating ...
(a) char[] arrays should be allowed to hold encodings other than UTF-8, and
thus initializing them with hex-FF byte values is not useful.
(b) UTF-8 encoding is not an efficient encoding for text analysis.
(c) UTF encodings are not optimized for data transmission (they contain
redundant data in many contexts).
(d) The D type called 'char' may not have been the best name to use if it
is meant to be used to contain only UTF-8 octets.

I, and many others including Walter, would probably agree to (b), (c) and
(d). However, considering (b) and (c), UTF has benefits that outweigh these
issues and there are ways to compensate for these too. Point (d) is a
casualty of history and to change the language now to rename 'char' to
anything else would be counter productive now. But feel free to implement
your own flavour of D.<g>

Back to point (a)... The fact is, char[] is designed to hold UTF-8
encodings so don't try to force anything else into such arrays. If you wish
to use some other encodings, then use a more appropriate data structure for
it. For example, to hold 'KOI-8' encodings of Russian text, I would
recommend using ubyte[] instead. To transform char[] to any other encoding
you will have to provide the functions to do that, as I don't think it is
Walter's or D's responsibilty to do it. The point of initializing UTF-8
strings with illegal values is to help detect coding or logical mistakes.
And a leading octet with the value of hex-FF in a UTF-8 encoded Unicode
codepoint *is* illegal. If you must store an octet of hex-FF then use
ubyte[] arrays to do it.

-- 
Derek Parnell
Melbourne, Australia
"Down with mediocrity!"