ubyte vs. char for non-UTF-8 (was Re: toString vs. toUtf8)

Matti Niemenmaa see_signature at for.real.address
Tue Nov 20 07:44:59 PST 2007


Walter Bright wrote:
> char[] => string
> wchar[] => wstring
> dchar[] => dstring
> 
> These are all unicode strings. Putting non-unicode encodings in them,
> even temporarily, should be discouraged. Non-unicode encodings should
> use ubyte[], ushort[], etc.

At last! This is the way I've been thinking it should be for a long time.
However, this requires a change to the language - make char/wchar/dchar types
implicitly convertible to ubyte/ushort/uint - and a bunch of library changes -
functions that don't require UTF should use ubyte/ushort/uint - in order to be
practically usable. Details follow.

Assume you have an ubyte[] named iso_8859_1_string which contains a string
encoded in ISO-8859-1. Now, to call std.string.strip on this and expect it to
work, you need to call "std.string.strip(*cast(char[]*)iso_8859_1_string.ptr)" -
note the annoying cast.

The same thing applies the other way, of course - assume the C standard library
accepts ubyte* instead of char* for all the C string functions. This is more
correct than the current situation, as the C standard library is
encoding-independent. Now, if you have a UTF-8 string which you wish to pass to
a C string handling function, you need to do, for instance:
"printf(cast(ubyte*)utf_8_string.ptr)" - another cast.

If encoding-independent functions accept only char, then it's the former case
for _every_ call to a string function when you're dealing with non-UTF strings,
which quickly becomes onerous.

I actually tried this, but the code ended up so unreadable that I was forced to
change it back, thus having arbitrarily-encoded bytes stored in char[], just for
the convenience of being able to use string functions on them.

Here're the details of the solution to this problem that I've thought of:

Make char, char*, char[], etc. all implicitly castable to the corresponding
ubyte types, and equivalently for wchar/ushort and dchar/uint. Then, functions
which require UTF-x can continue to use [dw]char while functions which work
regardless of encoding (most functions in std.string) should use ubyte. This
way, the functions transparently work for [dw]string whilst still working for
non-UTF.

To be precise, in the above, "work regardless of encoding" should be read as
"works on more than one encoding": even a simple function like std.string.strip
would have to be changed to work on EBCDIC, for instance. I would assume ASCII,
especially given that D doesn't target machines older than relatively modern
32-bit computers, to be the common subset. This way ubyte[] would mean "ASCII or
something else" and it's up to the programmer to not call it on functions which
require ASCII. I don't think this is a problem.

-- 
E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi



More information about the Digitalmars-d mailing list