ubyte vs. char for non-UTF-8 (was Re: toString vs. toUtf8)

Regan Heath regan at netmail.co.nz
Tue Nov 20 09:03:08 PST 2007


Matti Niemenmaa wrote:
> Walter Bright wrote:
>> char[] => string
>> wchar[] => wstring
>> dchar[] => dstring
>>
>> These are all unicode strings. Putting non-unicode encodings in them,
>> even temporarily, should be discouraged. Non-unicode encodings should
>> use ubyte[], ushort[], etc.
> 
> At last! This is the way I've been thinking it should be for a long time.
> However, this requires a change to the language - make char/wchar/dchar types
> implicitly convertible to ubyte/ushort/uint - and a bunch of library changes -
> functions that don't require UTF should use ubyte/ushort/uint - in order to be
> practically usable. Details follow.
> 
> Assume you have an ubyte[] named iso_8859_1_string which contains a string
> encoded in ISO-8859-1. Now, to call std.string.strip on this and expect it to
> work, you need to call "std.string.strip(*cast(char[]*)iso_8859_1_string.ptr)" -
> note the annoying cast.

I think we should be encouraging people to convert this data to UTF-8 
before calling any D string handling functions on it (those that accept 
w/d/char[]).  Which implies all D string handling functions should only 
operate on UTF-8/16/32.

If they want to call a C function like those in std.c.<whatever> on it, 
it should just work as expected.  Which implies std.c.<whatever> 
functions should accept ubyte* or void* or something, not char*

> The same thing applies the other way, of course - assume the C standard library
> accepts ubyte* instead of char* for all the C string functions. This is more
> correct than the current situation, as the C standard library is
> encoding-independent. Now, if you have a UTF-8 string which you wish to pass to
> a C string handling function, you need to do, for instance:
> "printf(cast(ubyte*)utf_8_string.ptr)" - another cast.

w/d/char[] arrays are implicitly convertable to void[] (and void*?) so 
perhaps C functions should accept void* instead?  I mean, void* means 
"pointer to something/anything"...

Regan



More information about the Digitalmars-d mailing list