To Walter, about char[] initialization by FF

Tue Aug 1 21:04:10 PDT 2006

"Derek Parnell" <derek at nomail.afraid.org> wrote in message 
news:13qrud1m5v15d$.ydqvoi8nx4f8.dlg at 40tude.net...
> On Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:
>
>> (Hope this long dialog will help all of us to better understand what 
>> UNICODE
>> is)
>>
>> "Walter Bright" <newshound at digitalmars.com> wrote in message
>> news:eao5st$2r1f$1 at digitaldaemon.com...
>>> Andrew Fedoniouk wrote:
>>>> Compiler accepts input stream as either BMP codes or full unicode set
>>> encoded using UTF-16.
>>>
>>> BMP is a subset of UTF-16.
>>
>> Walter with deepest respect but it is not. Two different things.
>>
>> UTF-16 is a variable-length enconding - byte stream.
>> Unicode BMP is a range of numbers strictly speaking.
>
> Andrew is correct. In UTF-16, characters are variable length, from 2 to 4
> bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used to
> be up to 6 but that has changed). UCS-2 is a subset of Unicode characters
> that are all represented by 2-byte integers. Windows NT had implemented
> UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now.
>
> ...
>
>>> ubyte[] will enable you to use any encoding you wish - and that's what
>>> it's there for.
>>
>> Thus the whole set of Windows API headers (and std.c.string for example)
>> seen in D has to be rewrited to accept ubyte[]. As char in D is not char 
>> in
>> C
>> Is this the idea?
>
> Yes. I believe this is how it now should be done. The Phobos library is 
> not
> correctly using char, char[], and ubyte[] when interfacing with Windows 
> and
> C functions.
>
> My guess is that Walter originally used 'char' to make things easier for C
> coders to move over to D, but in doing so, now with UTF support built-in,
> has caused more problems that the idea was supposed to solve. The move to
> UTF support is good, but the choice of 'char' for the name of a UTF-8
> code-unit was, and still is, a big mistake. I would have liked something
> more like ...
>
>  char  ==> An unsigned 8-bit byte. An alias for ubyte.
>  schar ==> A UTF-8 code unit.
>  wchar ==> A UTF-16 code unit.
>  dchar ==> A UTF-32 code unit.
>
>  char[] ==> A 'C' string
>  schar[] ==> A UTF-8 string
>  wchar[] ==> A UTF-16 string
>  dchar[] ==> A UTF-32 string
>
> And then have built-in conversions between the UTF encodings. So if people
> want to continue to use code from C/C++ that uses code-pages or similar
> they can stick with char[].
>
>

Yes, Derek, this will be probably near the ideal.

Andrew.