ubyte vs. char for non-UTF-8 (was Re: toString vs. toUtf8)

Tue Nov 20 15:10:08 PST 2007

Matti Niemenmaa wrote:
> Walter Bright wrote:
>> char[] => string
>> wchar[] => wstring
>> dchar[] => dstring
>>
>> These are all unicode strings. Putting non-unicode encodings in them,
>> even temporarily, should be discouraged. Non-unicode encodings should
>> use ubyte[], ushort[], etc.
> 
> At last! This is the way I've been thinking it should be for a long time.
> However, this requires a change to the language - make char/wchar/dchar types
> implicitly convertible to ubyte/ushort/uint - and a bunch of library changes -
> functions that don't require UTF should use ubyte/ushort/uint - in order to be
> practically usable. Details follow.
> 
> Assume you have an ubyte[] named iso_8859_1_string which contains a string
> encoded in ISO-8859-1. Now, to call std.string.strip on this and expect it to
> work, you need to call "std.string.strip(*cast(char[]*)iso_8859_1_string.ptr)" -
> note the annoying cast.
> 
> The same thing applies the other way, of course - assume the C standard library
> accepts ubyte* instead of char* for all the C string functions. This is more
> correct than the current situation, as the C standard library is
> encoding-independent. Now, if you have a UTF-8 string which you wish to pass to
> a C string handling function, you need to do, for instance:
> "printf(cast(ubyte*)utf_8_string.ptr)" - another cast.
> 
> If encoding-independent functions accept only char, then it's the former case
> for _every_ call to a string function when you're dealing with non-UTF strings,
> which quickly becomes onerous.
> 
> I actually tried this, but the code ended up so unreadable that I was forced to
> change it back, thus having arbitrarily-encoded bytes stored in char[], just for
> the convenience of being able to use string functions on them.
> 
> Here're the details of the solution to this problem that I've thought of:
> 
> Make char, char*, char[], etc. all implicitly castable to the corresponding
> ubyte types, and equivalently for wchar/ushort and dchar/uint. Then, functions
> which require UTF-x can continue to use [dw]char while functions which work
> regardless of encoding (most functions in std.string) should use ubyte. This
> way, the functions transparently work for [dw]string whilst still working for
> non-UTF.
> 
> To be precise, in the above, "work regardless of encoding" should be read as
> "works on more than one encoding": even a simple function like std.string.strip
> would have to be changed to work on EBCDIC, for instance. I would assume ASCII,
> especially given that D doesn't target machines older than relatively modern
> 32-bit computers, to be the common subset. This way ubyte[] would mean "ASCII or
> something else" and it's up to the programmer to not call it on functions which
> require ASCII. I don't think this is a problem.
> 

Perhapses {,w,d}char should become typedef of u{byte,short,int} and drop
as keywords?