ubyte vs. char for non-UTF-8 (was Re: toString vs. toUtf8)

Wed Nov 21 02:46:10 PST 2007

Regan Heath wrote:
> But, this behvaiour isn't guaranteed.  In fact I would expect that in
> future a library like iconv will be leveraged to determine if a
> character 'is a space' and it will assume the input data is UTF-8.

You're right. See below.

> So, if your ASCII based encoding has characters outside the ASCII range
> and they just happen to match a valid 'is a space' character from the
> UTF-8 set, then .. whoops.
> 
> Now, I don't have a canonical knowledge of character sets so it may be
> that there are no space characters outside the ASCII range defined in
> UTF-8... (perhaps when you include surogate pairs?) or, even if they
> exist the chance of an ASCII based character set using that value may be
> pretty small.

std.string.LS and std.string.PS are two examples of Unicode whitespace
characters. Strip, for some reason, does not strip them.

> Who knows, all I'm saying is that if a function says it accepts char[]
> then it is saying "I accept valid UTF-8" and not "I accept any ASCII
> based character data" so all bets are off if you pass it anything other
> than UTF-8.

You are correct, which is exactly my point: char[] should mean UTF-8 whereas
currently many functions use it to mean "text with single-byte characters".

That std.string.strip uses char[] currently says nothing about whether it
expects UTF-8 or not. Were the std.c package converted to use ubyte[]
everywhere, there would be a clear distinction between UTF-8 and "anything".
Then, as you say, one should interpret std.string.* as accepting only UTF-8.

> As far as I can see the only guaranteed thing is that the C functions
> will not change and will continue to accept ASCII based character sets
> without possible future gotchas.
> 
> So, if you must perform string manipulation on non UTF data then you
> should either write your own functions, or use the C ones.

Correct.

The point is that storing non-UTF data in ubyte/ushort/uint is a difficult task
because even the C functions take char (or wchar_t, which I think is wchar on
Windows and dchar elsewhere) and thus the code quickly becomes castville. cast
here, cast there, everywhere a cast cast - and for no good reason.

Thus I believe, as per my original proposal, that library functions be converted
to use ubyte[] where they are not meant to accept char[]. This may or may not
mean changes in std.string - it's up to the Phobos maintainers to make the
choice as to whether a function will ever require UTF-8, and whether to type it
as taking char[] or ubyte[]. In any case, at least the C functions should take
ubyte[].

The implicit casting from char-whatever to ubyte-whatever is useful when you
want to call C functions with D strings. Once again the code would rapidly
become castville if it would have to be done explicitly.

-- 
E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi