To Walter, about char[] initialization by FF
Andrew Fedoniouk
news at terrainformatica.com
Tue Aug 1 21:46:58 PDT 2006
"Regan Heath" <regan at netwin.co.nz> wrote in message
news:optdm2gghi23k2f5 at nrage...
> On Tue, 1 Aug 2006 21:04:10 -0700, Andrew Fedoniouk
> <news at terrainformatica.com> wrote:
>> "Derek Parnell" <derek at nomail.afraid.org> wrote in message
>> news:13qrud1m5v15d$.ydqvoi8nx4f8.dlg at 40tude.net...
>>> On Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:
>>>
>>>> (Hope this long dialog will help all of us to better understand what
>>>> UNICODE
>>>> is)
>>>>
>>>> "Walter Bright" <newshound at digitalmars.com> wrote in message
>>>> news:eao5st$2r1f$1 at digitaldaemon.com...
>>>>> Andrew Fedoniouk wrote:
>>>>>> Compiler accepts input stream as either BMP codes or full unicode set
>>>>> encoded using UTF-16.
>>>>>
>>>>> BMP is a subset of UTF-16.
>>>>
>>>> Walter with deepest respect but it is not. Two different things.
>>>>
>>>> UTF-16 is a variable-length enconding - byte stream.
>>>> Unicode BMP is a range of numbers strictly speaking.
>>>
>>> Andrew is correct. In UTF-16, characters are variable length, from 2 to
>>> 4
>>> bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used
>>> to
>>> be up to 6 but that has changed). UCS-2 is a subset of Unicode
>>> characters
>>> that are all represented by 2-byte integers. Windows NT had implemented
>>> UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now.
>>>
>>> ...
>>>
>>>>> ubyte[] will enable you to use any encoding you wish - and that's what
>>>>> it's there for.
>>>>
>>>> Thus the whole set of Windows API headers (and std.c.string for
>>>> example)
>>>> seen in D has to be rewrited to accept ubyte[]. As char in D is not
>>>> char
>>>> in
>>>> C
>>>> Is this the idea?
>>>
>>> Yes. I believe this is how it now should be done. The Phobos library is
>>> not
>>> correctly using char, char[], and ubyte[] when interfacing with Windows
>>> and
>>> C functions.
>>>
>>> My guess is that Walter originally used 'char' to make things easier
>>> for C
>>> coders to move over to D, but in doing so, now with UTF support
>>> built-in,
>>> has caused more problems that the idea was supposed to solve. The move
>>> to
>>> UTF support is good, but the choice of 'char' for the name of a UTF-8
>>> code-unit was, and still is, a big mistake. I would have liked something
>>> more like ...
>>>
>>> char ==> An unsigned 8-bit byte. An alias for ubyte.
>>> schar ==> A UTF-8 code unit.
>>> wchar ==> A UTF-16 code unit.
>>> dchar ==> A UTF-32 code unit.
>>>
>>> char[] ==> A 'C' string
>>> schar[] ==> A UTF-8 string
>>> wchar[] ==> A UTF-16 string
>>> dchar[] ==> A UTF-32 string
>>>
>>> And then have built-in conversions between the UTF encodings. So if
>>> people
>>> want to continue to use code from C/C++ that uses code-pages or similar
>>> they can stick with char[].
>>>
>>>
>>
>> Yes, Derek, this will be probably near the ideal.
>
> Yet, I don't find it at all difficult to think of them like so:
>
> ubyte ==> An unsigned 8-bit byte.
> char ==> A UTF-8 code unit.
> wchar ==> A UTF-16 code unit.
> dchar ==> A UTF-32 code unit.
>
> ubyte[] ==> A 'C' string
> char[] ==> A UTF-8 string
> wchar[] ==> A UTF-16 string
> dchar[] ==> A UTF-32 string
>
> If you want to program in D you _will_ have to readjust your thinking in
> some areas, this is one of them.
> All you have to realise is that 'char' in D is not the same as 'char' in
> C.
>
> In quick and dirty ASCII only applications I can adjust my thinking
> further:
>
> char ==> An ASCII character
> char[] ==> An ASCII string
>
> I do however agree that C functions used in D should be declared like:
> int strlen(ubyte* s);
>
> and not like (as they currently are):
> int strlen(char* s);
>
> The problem with this is that the code:
> char[] s = "test";
> strlen(s)
>
> would produce a compile error, and require a cast or a conversion function
> (toMBSz perhaps, which in many cases will not need to do anything).
>
> Of course the purists would say "That's perfectly correct, strlen cannot
> tell you the length of a UTF-8 string, only it's byte count", but at the
> same time it would be nice (for quick and dirty ASCII only programs) if it
> worked.
>
> Is it possible to declare them like this?
> int strlen(void* s);
>
> and for char[] to be implicitly 'paintable' as void* as char[] is already
> implicitly 'paintable' as void[]?
>
> It seems like it would nicely solve the problem of people seeing:
> int strlen(char* s);
>
> and thinking D's char is the same as C's char without introducing a
> painful need for cast or conversion in simple ASCII only situations.
>
> Regan
Another option will be to change char.init to 0 and forget about the problem
left it as it is now. Some good string implementation will
contain encoding field in string instance if needed.
Andrew.
More information about the Digitalmars-d
mailing list