ubyte vs. char for non-UTF-8 (was Re: toString vs. toUtf8)

Tue Nov 20 14:37:03 PST 2007

Matti Niemenmaa wrote:
> Assume you have an ubyte[] named iso_8859_1_string which contains a string
> encoded in ISO-8859-1. Now, to call std.string.strip on this and expect it to
> work, you need to call "std.string.strip(*cast(char[]*)iso_8859_1_string.ptr)" -
> note the annoying cast.

You can't assume that a function designed to work on an UTF-8 strings 
works with ISO-8859-1 strings. Beyond the ASCII range UTF-8 isn't 
compatible with any other charset.

> The same thing applies the other way, of course - assume the C standard library
> accepts ubyte* instead of char* for all the C string functions. This is more
> correct than the current situation, as the C standard library is
> encoding-independent. Now, if you have a UTF-8 string which you wish to pass to
> a C string handling function, you need to do, for instance:
> "printf(cast(ubyte*)utf_8_string.ptr)" - another cast.

This is probably the actual problem: C string functions should accept 
ubyte* instead of char* because a ubyte doesn't have an implied encoding 
while char does.

> If encoding-independent functions accept only char, then it's the former case
> for _every_ call to a string function when you're dealing with non-UTF strings,
> which quickly becomes onerous.

Unless you are referring to a conversion library like ICU, I don't 
understand your point on "encoding-independent functions". Phobos' 
string functions aren't "encoding-independent".

> I actually tried this, but the code ended up so unreadable that I was forced to
> change it back, thus having arbitrarily-encoded bytes stored in char[], just for
> the convenience of being able to use string functions on them.

If you've done that I fear you'll see lots of exceptions appearing in 
your string handling code once you deliver your program to any 
non-english speaking user.

> Here're the details of the solution to this problem that I've thought of:
> 
> Make char, char*, char[], etc. all implicitly castable to the corresponding
> ubyte types, and equivalently for wchar/ushort and dchar/uint. Then, functions
> which require UTF-x can continue to use [dw]char while functions which work
> regardless of encoding (most functions in std.string) should use ubyte. This
> way, the functions transparently work for [dw]string whilst still working for
> non-UTF.

Most function in std.string *require* UTF-8 or they'll blow up with a 
"Error: 4invalid UTF-8 sequence" message.

Actually, I think the implicit casting would be useful for string literals:

byte[] foo = "Julio César";	// In ISO-8859-1.

But then I need some way to tell the compiler that the string is in 
ISO-8859-1. What I don't see is where does your proposal helps with the 
example you were giving. For example, if I try to uppercase foo I would 
get an exception:

toupper(foo);	// BOOM!

> To be precise, in the above, "work regardless of encoding" should be read as
> "works on more than one encoding": even a simple function like std.string.strip
> would have to be changed to work on EBCDIC, for instance. I would assume ASCII,
> especially given that D doesn't target machines older than relatively modern
> 32-bit computers, to be the common subset. This way ubyte[] would mean "ASCII or
> something else" and it's up to the programmer to not call it on functions which
> require ASCII. I don't think this is a problem.

I think this is unrealistic unless you want to change std.string to be 
something more like ICU. There are just too many (popular) encodings and 
variations in use today... and you'll have to support most of them once 
you start promising to "works on more than one encoding".

Even Unicode has UCS which is the not-quite-UTF encoding used in Windows 
NT4 (yes, there are still lots of machines using NT4).

-- 
Julio César Carrascal Urquijo
http://jcesar.artelogico.com/