To Walter, about char[] initialization by FF

Andrew Fedoniouk news at terrainformatica.com
Tue Aug 1 21:46:58 PDT 2006


"Regan Heath" <regan at netwin.co.nz> wrote in message 
news:optdm2gghi23k2f5 at nrage...
> On Tue, 1 Aug 2006 21:04:10 -0700, Andrew Fedoniouk 
> <news at terrainformatica.com> wrote:
>> "Derek Parnell" <derek at nomail.afraid.org> wrote in message
>> news:13qrud1m5v15d$.ydqvoi8nx4f8.dlg at 40tude.net...
>>> On Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:
>>>
>>>> (Hope this long dialog will help all of us to better understand what
>>>> UNICODE
>>>> is)
>>>>
>>>> "Walter Bright" <newshound at digitalmars.com> wrote in message
>>>> news:eao5st$2r1f$1 at digitaldaemon.com...
>>>>> Andrew Fedoniouk wrote:
>>>>>> Compiler accepts input stream as either BMP codes or full unicode set
>>>>> encoded using UTF-16.
>>>>>
>>>>> BMP is a subset of UTF-16.
>>>>
>>>> Walter with deepest respect but it is not. Two different things.
>>>>
>>>> UTF-16 is a variable-length enconding - byte stream.
>>>> Unicode BMP is a range of numbers strictly speaking.
>>>
>>> Andrew is correct. In UTF-16, characters are variable length, from 2 to 
>>> 4
>>> bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used 
>>> to
>>> be up to 6 but that has changed). UCS-2 is a subset of Unicode 
>>> characters
>>> that are all represented by 2-byte integers. Windows NT had implemented
>>> UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now.
>>>
>>> ...
>>>
>>>>> ubyte[] will enable you to use any encoding you wish - and that's what
>>>>> it's there for.
>>>>
>>>> Thus the whole set of Windows API headers (and std.c.string for 
>>>> example)
>>>> seen in D has to be rewrited to accept ubyte[]. As char in D is not 
>>>> char
>>>> in
>>>> C
>>>> Is this the idea?
>>>
>>> Yes. I believe this is how it now should be done. The Phobos library is
>>> not
>>> correctly using char, char[], and ubyte[] when interfacing with Windows
>>> and
>>> C functions.
>>>
>>> My guess is that Walter originally used 'char' to make things easier 
>>> for C
>>> coders to move over to D, but in doing so, now with UTF support 
>>> built-in,
>>> has caused more problems that the idea was supposed to solve. The move 
>>> to
>>> UTF support is good, but the choice of 'char' for the name of a UTF-8
>>> code-unit was, and still is, a big mistake. I would have liked something
>>> more like ...
>>>
>>>  char  ==> An unsigned 8-bit byte. An alias for ubyte.
>>>  schar ==> A UTF-8 code unit.
>>>  wchar ==> A UTF-16 code unit.
>>>  dchar ==> A UTF-32 code unit.
>>>
>>>  char[] ==> A 'C' string
>>>  schar[] ==> A UTF-8 string
>>>  wchar[] ==> A UTF-16 string
>>>  dchar[] ==> A UTF-32 string
>>>
>>> And then have built-in conversions between the UTF encodings. So if 
>>> people
>>> want to continue to use code from C/C++ that uses code-pages or similar
>>> they can stick with char[].
>>>
>>>
>>
>> Yes, Derek, this will be probably near the ideal.
>
> Yet, I don't find it at all difficult to think of them like so:
>
>   ubyte ==> An unsigned 8-bit byte.
>   char  ==> A UTF-8 code unit.
>   wchar ==> A UTF-16 code unit.
>   dchar ==> A UTF-32 code unit.
>
>   ubyte[] ==> A 'C' string
>   char[]  ==> A UTF-8 string
>   wchar[] ==> A UTF-16 string
>   dchar[] ==> A UTF-32 string
>
> If you want to program in D you _will_ have to readjust your thinking in 
> some areas, this is one of them.
> All you have to realise is that 'char' in D is not the same as 'char' in 
> C.
>
> In quick and dirty ASCII only applications I can adjust my thinking 
> further:
>
>   char   ==> An ASCII character
>   char[] ==> An ASCII string
>
> I do however agree that C functions used in D should be declared like:
>   int strlen(ubyte* s);
>
> and not like (as they currently are):
>   int strlen(char* s);
>
> The problem with this is that the code:
>   char[] s = "test";
>   strlen(s)
>
> would produce a compile error, and require a cast or a conversion function 
> (toMBSz perhaps, which in many cases will not need to do anything).
>
> Of course the purists would say "That's perfectly correct, strlen cannot 
> tell you the length of a UTF-8 string, only it's byte count", but at the 
> same time it would be nice (for quick and dirty ASCII only programs) if it 
> worked.
>
> Is it possible to declare them like this?
>   int strlen(void* s);
>
> and for char[] to be implicitly 'paintable' as void* as char[] is already 
> implicitly 'paintable' as void[]?
>
> It seems like it would nicely solve the problem of people seeing:
>   int strlen(char* s);
>
> and thinking D's char is the same as C's char without introducing a 
> painful need for cast or conversion in simple ASCII only situations.
>
> Regan

Another option will be to change char.init to 0 and forget about the problem
left it as it is now.  Some good string implementation will
contain encoding field in string instance if needed.

Andrew.






More information about the Digitalmars-d mailing list