ubyte vs. char for non-UTF-8 (was Re: toString vs. toUtf8)

Regan Heath regan at netmail.co.nz
Tue Nov 20 09:50:30 PST 2007


Matti Niemenmaa wrote:
> Regan Heath wrote:
>> I think we should be encouraging people to convert this data to UTF-8
>> before calling any D string handling functions on it (those that accept
>> w/d/char[]).  Which implies all D string handling functions should only
>> operate on UTF-8/16/32.
> 
> This is an impossible task. Given a plaintext file, you cannot know what
> encoding it is in. If you assume an encoding and convert it to UTF-8 for
> internal use and then recode it back to that encoding for output, you may lose
> information.

Yep, but the same thing may occur calling a D string function as it 
expects UTF-8 and may even convert to dchar[] internally (which would 
probably throw an invalid UTF exception).

Worse, it might work in one version of the library and fail in another 
due to internal changes of that sort.  Meaning, the function cannot 
guarantee to operate on your 'could be any encoding' data.

You'd be better of passing this data to the C function that does what 
you want.

Convert input early and output late I reckon.

>> w/d/char[] arrays are implicitly convertable to void[] (and void*?) so
>> perhaps C functions should accept void* instead?  I mean, void* means
>> "pointer to something/anything"...
> 
> void* means "pointer to anything", as you say. ubyte* means "pointer to unsigned
> byte(s)", which is a different thing entirely.
 >
> To me, ubyte[] means either integers in the range 0-255 or "arbitrary data".
> void[] is more like "arbitrary memory": used for hacking around language
> restrictions or for extremely low-level stuff such as memory management.
>
> Would you consider malloc as returning the same type of data which mbstrlen accepts?

Not the same type of data, but they could give/accept the same pointer.

void *p = malloc(100);
strcpy((char*)p, "test");
printf("%d", mbstrlen(p));

Memory is memory, the only difference between char* and void* is that 
char* knows (thinks) it's pointing at a char.

What about other text encodings which do not have 8 bit sized 
'character' pieces, like UCS-2 (but not because UCS-2 is a subset of 
UTF-16 and we can handle it as such).  I'm not sure any exist, so this 
point may be invalid, but if one did exist then ubyte[] would not be the 
correct way to store it, perhaps ushort[] would.

Or.. we could use void[]/void* for all types of unknown data and be done 
with it.  Using void* basically says "we don't know the type/format of 
the data but we assume the function receiving the data does".

Regan



More information about the Digitalmars-d mailing list