ubyte vs. char for non-UTF-8 (was Re: toString vs. toUtf8)

Regan Heath regan at netmail.co.nz
Wed Nov 21 02:34:29 PST 2007


Matti Niemenmaa wrote:
> Julio César Carrascal Urquijo wrote:
>> Matti Niemenmaa wrote:
>>> Assume you have an ubyte[] named iso_8859_1_string which contains a string
>>>  encoded in ISO-8859-1. Now, to call std.string.strip on this and expect it
>>>  to work, you need to call 
>>> "std.string.strip(*cast(char[]*)iso_8859_1_string.ptr)" - note the annoying
>>>  cast.
>> You can't assume that a function designed to work on an UTF-8 strings works 
>> with ISO-8859-1 strings. Beyond the ASCII range UTF-8 isn't compatible with 
>> any other charset.
> 
> I am well aware of this. I chose strip as an example because it does work on any
> encoding: it simply calls std.ctype.isspace on each char.

But, this behvaiour isn't guaranteed.  In fact I would expect that in 
future a library like iconv will be leveraged to determine if a 
character 'is a space' and it will assume the input data is UTF-8.

So, if your ASCII based encoding has characters outside the ASCII range 
and they just happen to match a valid 'is a space' character from the 
UTF-8 set, then .. whoops.

Now, I don't have a canonical knowledge of character sets so it may be 
that there are no space characters outside the ASCII range defined in 
UTF-8... (perhaps when you include surogate pairs?) or, even if they 
exist the chance of an ASCII based character set using that value may be 
pretty small.

Who knows, all I'm saying is that if a function says it accepts char[] 
then it is saying "I accept valid UTF-8" and not "I accept any ASCII 
based character data" so all bets are off if you pass it anything other 
than UTF-8.

>> This is probably the actual problem: C string functions should accept ubyte* 
>> instead of char* because a ubyte doesn't have an implied encoding while char 
>> does.
> 
> Yes. But there are also many D string functions which would work on any encoding.

At present. But that's not guaranteed and it may change in the future, 
in fact, I expect it to.

As far as I can see the only guaranteed thing is that the C functions 
will not change and will continue to accept ASCII based character sets 
without possible future gotchas.

So, if you must perform string manipulation on non UTF data then you 
should either write your own functions, or use the C ones.

Regan



More information about the Digitalmars-d mailing list