ubyte vs. char for non-UTF-8 (was Re: toString vs. toUtf8)

Wed Nov 21 02:21:25 PST 2007

Julio César Carrascal Urquijo wrote:
> Matti Niemenmaa wrote:
>> Assume you have an ubyte[] named iso_8859_1_string which contains a string
>>  encoded in ISO-8859-1. Now, to call std.string.strip on this and expect it
>>  to work, you need to call 
>> "std.string.strip(*cast(char[]*)iso_8859_1_string.ptr)" - note the annoying
>>  cast.
> 
> You can't assume that a function designed to work on an UTF-8 strings works 
> with ISO-8859-1 strings. Beyond the ASCII range UTF-8 isn't compatible with 
> any other charset.

I am well aware of this. I chose strip as an example because it does work on any
encoding: it simply calls std.ctype.isspace on each char.

> This is probably the actual problem: C string functions should accept ubyte* 
> instead of char* because a ubyte doesn't have an implied encoding while char 
> does.

Yes. But there are also many D string functions which would work on any encoding.

>> If encoding-independent functions accept only char, then it's the former 
>> case for _every_ call to a string function when you're dealing with non-UTF
>>  strings, which quickly becomes onerous.
> 
> Unless you are referring to a conversion library like ICU, I don't understand
>  your point on "encoding-independent functions". Phobos' string functions 
> aren't "encoding-independent".

Most are, actually, except for the fact that D character constants are always
ASCII. Almost all the std.string functions will work for any "extended ASCII"
encoding.

And that's what I mean. Given that D doesn't target the kind of machines that
use EBCDIC, I use "encoding-independent" to mean either "works on any encoding"
or "works on any encoding with ASCII as the lower 128 values".

>> I actually tried this, but the code ended up so unreadable that I was 
>> forced to change it back, thus having arbitrarily-encoded bytes stored in 
>> char[], just for the convenience of being able to use string functions on 
>> them.
> 
> If you've done that I fear you'll see lots of exceptions appearing in your 
> string handling code once you deliver your program to any non-english 
> speaking user.

Trust me, I know what I'm doing.

For instance, the integer conversion functions in std.conv only look for values
in the range '0' to '9', ignoring all others. If the encoding has the digits in
the same place as ASCII, it will work, regardless of what all the other bytes in
the encoding are.

If the encoding has the digits in a different place than ASCII, then it won't
work, true. But I think you'll find that using EBCDIC or another non-ASCII-based
encoding will confuse most of the programs you've got installed on your computer.

> Most function in std.string *require* UTF-8 or they'll blow up with a "Error:
>  4invalid UTF-8 sequence" message.

No, they do not. Some do, but not most. Of all the functions that take char[] or
char* in std.string:

Functions     requiring UTF-8: 22
Functions not requiring UTF-8: 35

> Actually, I think the implicit casting would be useful for string literals:
> 
> byte[] foo = "Julio César";    // In ISO-8859-1.
> 
> But then I need some way to tell the compiler that the string is in 
> ISO-8859-1. What I don't see is where does your proposal helps with the 
> example you were giving. For example, if I try to uppercase foo I would get 
> an exception:
> 
> toupper(foo);    // BOOM!

True, you would, because std.string.toupper assumes UTF-8. Hence, its type
should be string(string), which you couldn't call with byte[], since byte[]
doesn't implicitly convert to char[].

But consider what happens now with char[]. The following program compiles, but
blows up at runtime:

import std.string;
void main() {
	char[] foo = "Julio C\xe9sar";
	toupper(foo);
}

An amendment to my proposal to correct this would be that hex strings, and any
string which contains a byte sequence which is not valid UTF, would become
ubyte/ushort/uint. Thus the above would fail with a type error because the type
of the literal is ubyte[], and it cannot be assigned to a char[]. If the type of
foo were ubyte[], calling toupper would fail with a type error.

Thereby the only way to get the program above to compile, aside from changing
the string literal to UTF-8, would be with a cast, which shows that there's
something unsafe going on.

> I think this is unrealistic unless you want to change std.string to be 
> something more like ICU. There are just too many (popular) encodings and 
> variations in use today... and you'll have to support most of them once you 
> start promising to "works on more than one encoding".

By "works on more than one encoding" I meant "works for anything with ASCII as
the lower 128 bytes". You'll find that covers the majority of encodings in
common use today.

-- 
E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi