To Walter, about char[] initialization by FF

Sun Jul 30 08:55:45 PDT 2006

Derek wrote:
> On Sat, 29 Jul 2006 13:27:14 -0700, Andrew Fedoniouk wrote:
> 
> 
>> ... but this is far from concept of null codepoint in character encodings.
> 
> Andrew and others,
> I've read through these posts a few times now, trying to understand the
> various points of view being presented. I keep getting the feeling that
> some people are deliberately trying *not* to understand what other people
> are saying. This is a sad situation.
> 
> Andrew seems to be stating ...
> (a) char[] arrays should be allowed to hold encodings other than UTF-8, and
> thus initializing them with hex-FF byte values is not useful.
> (b) UTF-8 encoding is not an efficient encoding for text analysis.
> (c) UTF encodings are not optimized for data transmission (they contain
> redundant data in many contexts).
> (d) The D type called 'char' may not have been the best name to use if it
> is meant to be used to contain only UTF-8 octets.
> 
> I, and many others including Walter, would probably agree to (b), (c) and
> (d). However, considering (b) and (c), UTF has benefits that outweigh these
> issues and there are ways to compensate for these too. Point (d) is a
> casualty of history and to change the language now to rename 'char' to
> anything else would be counter productive now. But feel free to implement
> your own flavour of D.<g>
> 
> Back to point (a)... The fact is, char[] is designed to hold UTF-8
> encodings so don't try to force anything else into such arrays. If you wish
> to use some other encodings, then use a more appropriate data structure for
> it. For example, to hold 'KOI-8' encodings of Russian text, I would
> recommend using ubyte[] instead. To transform char[] to any other encoding
> you will have to provide the functions to do that, as I don't think it is
> Walter's or D's responsibilty to do it. The point of initializing UTF-8
> strings with illegal values is to help detect coding or logical mistakes.
> And a leading octet with the value of hex-FF in a UTF-8 encoded Unicode
> codepoint *is* illegal. If you must store an octet of hex-FF then use
> ubyte[] arrays to do it.
> 

Good summary. Additionally I'd like to say that, to hold 'KOI-8' 
encodings, you could create a typedef instead of just using a ubyte;

   typedef ubyte koi8char;

Thus you are able to express in the code, what the encoding of such 
ubyte is, as it is part of the type information. And then the program is 
able to work with it:

   koi8char toUpper(koi8char ch) { ...
   int wordCount(koi8char[] str) { ...
   dchar[] toUTF32(koi8char[] str) { ...

-- 
Bruno Medeiros - MSc in CS/E student
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D