To Walter, about char[] initialization by FF

Derek Parnell derek at nomail.afraid.org
Tue Aug 1 20:28:30 PDT 2006


On Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:

> (Hope this long dialog will help all of us to better understand what UNICODE 
> is)
> 
> "Walter Bright" <newshound at digitalmars.com> wrote in message 
> news:eao5st$2r1f$1 at digitaldaemon.com...
>> Andrew Fedoniouk wrote:
>>> Compiler accepts input stream as either BMP codes or full unicode set
>> encoded using UTF-16.
>>
>> BMP is a subset of UTF-16.
> 
> Walter with deepest respect but it is not. Two different things.
> 
> UTF-16 is a variable-length enconding - byte stream.
> Unicode BMP is a range of numbers strictly speaking.

Andrew is correct. In UTF-16, characters are variable length, from 2 to 4
bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used to
be up to 6 but that has changed). UCS-2 is a subset of Unicode characters
that are all represented by 2-byte integers. Windows NT had implemented
UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now.

...

>> ubyte[] will enable you to use any encoding you wish - and that's what 
>> it's there for.
> 
> Thus the whole set of Windows API headers (and std.c.string for example)
> seen in D has to be rewrited to accept ubyte[]. As char in D is not char in 
> C
> Is this the idea?

Yes. I believe this is how it now should be done. The Phobos library is not
correctly using char, char[], and ubyte[] when interfacing with Windows and
C functions. 

My guess is that Walter originally used 'char' to make things easier for C
coders to move over to D, but in doing so, now with UTF support built-in,
has caused more problems that the idea was supposed to solve. The move to
UTF support is good, but the choice of 'char' for the name of a UTF-8
code-unit was, and still is, a big mistake. I would have liked something
more like ...

  char  ==> An unsigned 8-bit byte. An alias for ubyte.
  schar ==> A UTF-8 code unit.
  wchar ==> A UTF-16 code unit.
  dchar ==> A UTF-32 code unit.

  char[] ==> A 'C' string 
  schar[] ==> A UTF-8 string
  wchar[] ==> A UTF-16 string
  dchar[] ==> A UTF-32 string

And then have built-in conversions between the UTF encodings. So if people
want to continue to use code from C/C++ that uses code-pages or similar
they can stick with char[]. 



-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
"Down with mediocrity!"
2/08/2006 1:08:51 PM



More information about the Digitalmars-d mailing list