ubyte vs. char for non-UTF-8 (was Re: toString vs. toUtf8)

Tue Nov 20 10:02:11 PST 2007

Regan Heath wrote:
> Matti Niemenmaa wrote:
>> Regan Heath wrote:
>>> I think we should be encouraging people to convert this data to UTF-8 
>>> before calling any D string handling functions on it (those that accept 
>>> w/d/char[]).  Which implies all D string handling functions should only 
>>> operate on UTF-8/16/32.
>> 
>> This is an impossible task. Given a plaintext file, you cannot know what 
>> encoding it is in. If you assume an encoding and convert it to UTF-8 for 
>> internal use and then recode it back to that encoding for output, you may 
>> lose information.
> 
> Yep, but the same thing may occur calling a D string function as it expects 
> UTF-8 and may even convert to dchar[] internally (which would probably throw
>  an invalid UTF exception).

Which is why I think that unless you know it's UTF-8, you should use ubyte[].
Functions which expect UTF-8 would require char[], thus causing a type error.

> You'd be better of passing this data to the C function that does what you 
> want.

There's not always a C function that does what you want available. Both Phobos's
and Tango's string processing capabilities are greater than the C standard
library's even for plain ASCII.

The point is to make it easy to use non-UTF strings when necessary, without
having to resort to huge amounts of casts or writing your own functions with the
correct type signatures.

> What about other text encodings which do not have 8 bit sized 'character' 
> pieces, like UCS-2 (but not because UCS-2 is a subset of UTF-16 and we can 
> handle it as such).  I'm not sure any exist, so this point may be invalid, 
> but if one did exist then ubyte[] would not be the correct way to store it, 
> perhaps ushort[] would.

Walter mentioned ushort[] in his post, as did I in mine.

> Or.. we could use void[]/void* for all types of unknown data and be done with
> it.  Using void* basically says "we don't know the type/format of the data 
> but we assume the function receiving the data does".

I just think "void" means "typeless" or "I don't know the type". "ubyte" means
something like "byte-oriented data" or "I don't care about the type". It all
depends on your point of view, but I think it's nice to have a semantic
difference between void and ubyte.

The meaning of plain byte, on the other hand, eludes me, beyond just "integer
from -128 to 127".

The problem with using void to store data is also that the garbage collectors
assume it may contain pointers, and thus scan it for uncollected memory. It may
also be that if they find a valid pointer (small, but nonzero, probability) they
do not free memory which should be released, thus retaining it as long as the
data lives, which could be as long as the program runs.

Hell, we /could/ use void[] to replace char[], byte[], and ubyte[], and why not
the rest of the types, too. But this isn't asm. This is D!

-- 
E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi