To Walter, about char[] initialization by FF

Sun Jul 30 07:21:26 PDT 2006

Derek wrote:
> On Sat, 29 Jul 2006 13:27:14 -0700, Andrew Fedoniouk wrote:
> 
> 
>> ... but this is far from concept of null codepoint in character encodings.
> 
> Andrew and others,
> I've read through these posts a few times now, trying to understand the
> various points of view being presented. I keep getting the feeling that
> some people are deliberately trying *not* to understand what other people
> are saying. This is a sad situation.
> 
> Andrew seems to be stating ...
> (a) char[] arrays should be allowed to hold encodings other than UTF-8, and
> thus initializing them with hex-FF byte values is not useful.
> (b) UTF-8 encoding is not an efficient encoding for text analysis.
> (c) UTF encodings are not optimized for data transmission (they contain
> redundant data in many contexts).
> (d) The D type called 'char' may not have been the best name to use if it
> is meant to be used to contain only UTF-8 octets.
> 
> I, and many others including Walter, would probably agree to (b), (c) and
> (d). However, considering (b) and (c), UTF has benefits that outweigh these
> issues and there are ways to compensate for these too. Point (d) is a
> casualty of history and to change the language now to rename 'char' to
> anything else would be counter productive now. But feel free to implement
> your own flavour of D.<g>
> 
> Back to point (a)... The fact is, char[] is designed to hold UTF-8
> encodings so don't try to force anything else into such arrays. If you wish
> to use some other encodings, then use a more appropriate data structure for
> it. For example, to hold 'KOI-8' encodings of Russian text, I would
> recommend using ubyte[] instead. To transform char[] to any other encoding
> you will have to provide the functions to do that, as I don't think it is
> Walter's or D's responsibilty to do it. The point of initializing UTF-8
> strings with illegal values is to help detect coding or logical mistakes.
> And a leading octet with the value of hex-FF in a UTF-8 encoded Unicode
> codepoint *is* illegal. If you must store an octet of hex-FF then use
> ubyte[] arrays to do it.

Thank you for the clear summary.

Apart from the obvious (d), I think there are two reasons this char 
confusion comes up now and then.

1. The documentation may not be clear enough on the point that char is 
really only meant to represent an UTF-8 code unit (or ASCII character) 
and that char[] is an UTF-8 encoded string. It seems it needs to be more 
stressed. People coming from C will automatically assume the D char is a 
C char equivalent. It should be mentioned that dchar is the only type 
that can represent any Unicode character, while char is a character only 
in ASCII.

The C to D type conversion table doesn't help either:
http://www.digitalmars.com/d/ctod.html
It should say something like:
char => char (UTF-8 and ASCII strings) ubyte (other byte based encodings)

2. All string functions in Phobos work only on char[] (and in some cases 
wchar[] and dchar[]), making the tools for working with other string 
encodings extremely limited. This is easily remedied by a templated 
string library, such as what I have proposed earlier.

/Oskar