ubyte vs. char for non-UTF-8 (was Re: toString vs. toUtf8)
Regan Heath
regan at netmail.co.nz
Tue Nov 20 09:03:08 PST 2007
Matti Niemenmaa wrote:
> Walter Bright wrote:
>> char[] => string
>> wchar[] => wstring
>> dchar[] => dstring
>>
>> These are all unicode strings. Putting non-unicode encodings in them,
>> even temporarily, should be discouraged. Non-unicode encodings should
>> use ubyte[], ushort[], etc.
>
> At last! This is the way I've been thinking it should be for a long time.
> However, this requires a change to the language - make char/wchar/dchar types
> implicitly convertible to ubyte/ushort/uint - and a bunch of library changes -
> functions that don't require UTF should use ubyte/ushort/uint - in order to be
> practically usable. Details follow.
>
> Assume you have an ubyte[] named iso_8859_1_string which contains a string
> encoded in ISO-8859-1. Now, to call std.string.strip on this and expect it to
> work, you need to call "std.string.strip(*cast(char[]*)iso_8859_1_string.ptr)" -
> note the annoying cast.
I think we should be encouraging people to convert this data to UTF-8
before calling any D string handling functions on it (those that accept
w/d/char[]). Which implies all D string handling functions should only
operate on UTF-8/16/32.
If they want to call a C function like those in std.c.<whatever> on it,
it should just work as expected. Which implies std.c.<whatever>
functions should accept ubyte* or void* or something, not char*
> The same thing applies the other way, of course - assume the C standard library
> accepts ubyte* instead of char* for all the C string functions. This is more
> correct than the current situation, as the C standard library is
> encoding-independent. Now, if you have a UTF-8 string which you wish to pass to
> a C string handling function, you need to do, for instance:
> "printf(cast(ubyte*)utf_8_string.ptr)" - another cast.
w/d/char[] arrays are implicitly convertable to void[] (and void*?) so
perhaps C functions should accept void* instead? I mean, void* means
"pointer to something/anything"...
Regan
More information about the Digitalmars-d
mailing list