To Walter, about char[] initialization by FF

Tue Aug 1 21:22:54 PDT 2006

On Tue, 1 Aug 2006 21:04:10 -0700, Andrew Fedoniouk  
<news at terrainformatica.com> wrote:
> "Derek Parnell" <derek at nomail.afraid.org> wrote in message
> news:13qrud1m5v15d$.ydqvoi8nx4f8.dlg at 40tude.net...
>> On Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:
>>
>>> (Hope this long dialog will help all of us to better understand what
>>> UNICODE
>>> is)
>>>
>>> "Walter Bright" <newshound at digitalmars.com> wrote in message
>>> news:eao5st$2r1f$1 at digitaldaemon.com...
>>>> Andrew Fedoniouk wrote:
>>>>> Compiler accepts input stream as either BMP codes or full unicode set
>>>> encoded using UTF-16.
>>>>
>>>> BMP is a subset of UTF-16.
>>>
>>> Walter with deepest respect but it is not. Two different things.
>>>
>>> UTF-16 is a variable-length enconding - byte stream.
>>> Unicode BMP is a range of numbers strictly speaking.
>>
>> Andrew is correct. In UTF-16, characters are variable length, from 2 to  
>> 4
>> bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used  
>> to
>> be up to 6 but that has changed). UCS-2 is a subset of Unicode  
>> characters
>> that are all represented by 2-byte integers. Windows NT had implemented
>> UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now.
>>
>> ...
>>
>>>> ubyte[] will enable you to use any encoding you wish - and that's what
>>>> it's there for.
>>>
>>> Thus the whole set of Windows API headers (and std.c.string for  
>>> example)
>>> seen in D has to be rewrited to accept ubyte[]. As char in D is not  
>>> char
>>> in
>>> C
>>> Is this the idea?
>>
>> Yes. I believe this is how it now should be done. The Phobos library is
>> not
>> correctly using char, char[], and ubyte[] when interfacing with Windows
>> and
>> C functions.
>>
>> My guess is that Walter originally used 'char' to make things easier  
>> for C
>> coders to move over to D, but in doing so, now with UTF support  
>> built-in,
>> has caused more problems that the idea was supposed to solve. The move  
>> to
>> UTF support is good, but the choice of 'char' for the name of a UTF-8
>> code-unit was, and still is, a big mistake. I would have liked something
>> more like ...
>>
>>  char  ==> An unsigned 8-bit byte. An alias for ubyte.
>>  schar ==> A UTF-8 code unit.
>>  wchar ==> A UTF-16 code unit.
>>  dchar ==> A UTF-32 code unit.
>>
>>  char[] ==> A 'C' string
>>  schar[] ==> A UTF-8 string
>>  wchar[] ==> A UTF-16 string
>>  dchar[] ==> A UTF-32 string
>>
>> And then have built-in conversions between the UTF encodings. So if  
>> people
>> want to continue to use code from C/C++ that uses code-pages or similar
>> they can stick with char[].
>>
>>
>
> Yes, Derek, this will be probably near the ideal.

Yet, I don't find it at all difficult to think of them like so:

   ubyte ==> An unsigned 8-bit byte.
   char  ==> A UTF-8 code unit.
   wchar ==> A UTF-16 code unit.
   dchar ==> A UTF-32 code unit.

   ubyte[] ==> A 'C' string
   char[]  ==> A UTF-8 string
   wchar[] ==> A UTF-16 string
   dchar[] ==> A UTF-32 string

If you want to program in D you _will_ have to readjust your thinking in  
some areas, this is one of them.
All you have to realise is that 'char' in D is not the same as 'char' in C.

In quick and dirty ASCII only applications I can adjust my thinking  
further:

   char   ==> An ASCII character
   char[] ==> An ASCII string

I do however agree that C functions used in D should be declared like:
   int strlen(ubyte* s);

and not like (as they currently are):
   int strlen(char* s);

The problem with this is that the code:
   char[] s = "test";
   strlen(s)

would produce a compile error, and require a cast or a conversion function  
(toMBSz perhaps, which in many cases will not need to do anything).

Of course the purists would say "That's perfectly correct, strlen cannot  
tell you the length of a UTF-8 string, only it's byte count", but at the  
same time it would be nice (for quick and dirty ASCII only programs) if it  
worked.

Is it possible to declare them like this?
   int strlen(void* s);

and for char[] to be implicitly 'paintable' as void* as char[] is already  
implicitly 'paintable' as void[]?

It seems like it would nicely solve the problem of people seeing:
   int strlen(char* s);

and thinking D's char is the same as C's char without introducing a  
painful need for cast or conversion in simple ASCII only situations.

Regan