To Walter, about char[] initialization by FF
Derek Parnell
derek at nomail.afraid.org
Tue Aug 1 20:28:30 PDT 2006
On Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:
> (Hope this long dialog will help all of us to better understand what UNICODE
> is)
>
> "Walter Bright" <newshound at digitalmars.com> wrote in message
> news:eao5st$2r1f$1 at digitaldaemon.com...
>> Andrew Fedoniouk wrote:
>>> Compiler accepts input stream as either BMP codes or full unicode set
>> encoded using UTF-16.
>>
>> BMP is a subset of UTF-16.
>
> Walter with deepest respect but it is not. Two different things.
>
> UTF-16 is a variable-length enconding - byte stream.
> Unicode BMP is a range of numbers strictly speaking.
Andrew is correct. In UTF-16, characters are variable length, from 2 to 4
bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used to
be up to 6 but that has changed). UCS-2 is a subset of Unicode characters
that are all represented by 2-byte integers. Windows NT had implemented
UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now.
...
>> ubyte[] will enable you to use any encoding you wish - and that's what
>> it's there for.
>
> Thus the whole set of Windows API headers (and std.c.string for example)
> seen in D has to be rewrited to accept ubyte[]. As char in D is not char in
> C
> Is this the idea?
Yes. I believe this is how it now should be done. The Phobos library is not
correctly using char, char[], and ubyte[] when interfacing with Windows and
C functions.
My guess is that Walter originally used 'char' to make things easier for C
coders to move over to D, but in doing so, now with UTF support built-in,
has caused more problems that the idea was supposed to solve. The move to
UTF support is good, but the choice of 'char' for the name of a UTF-8
code-unit was, and still is, a big mistake. I would have liked something
more like ...
char ==> An unsigned 8-bit byte. An alias for ubyte.
schar ==> A UTF-8 code unit.
wchar ==> A UTF-16 code unit.
dchar ==> A UTF-32 code unit.
char[] ==> A 'C' string
schar[] ==> A UTF-8 string
wchar[] ==> A UTF-16 string
dchar[] ==> A UTF-32 string
And then have built-in conversions between the UTF encodings. So if people
want to continue to use code from C/C++ that uses code-pages or similar
they can stick with char[].
--
Derek
(skype: derek.j.parnell)
Melbourne, Australia
"Down with mediocrity!"
2/08/2006 1:08:51 PM
More information about the Digitalmars-d
mailing list