ubyte vs. char for non-UTF-8 (was Re: toString vs. toUtf8)

Tue Nov 20 09:12:01 PST 2007

Regan Heath wrote:
> I think we should be encouraging people to convert this data to UTF-8
> before calling any D string handling functions on it (those that accept
> w/d/char[]).  Which implies all D string handling functions should only
> operate on UTF-8/16/32.

This is an impossible task. Given a plaintext file, you cannot know what
encoding it is in. If you assume an encoding and convert it to UTF-8 for
internal use and then recode it back to that encoding for output, you may lose
information.

> w/d/char[] arrays are implicitly convertable to void[] (and void*?) so
> perhaps C functions should accept void* instead?  I mean, void* means
> "pointer to something/anything"...

void* means "pointer to anything", as you say. ubyte* means "pointer to unsigned
byte(s)", which is a different thing entirely.

To me, ubyte[] means either integers in the range 0-255 or "arbitrary data".
void[] is more like "arbitrary memory": used for hacking around language
restrictions or for extremely low-level stuff such as memory management.

Would you consider malloc as returning the same type of data which mbstrlen accepts?

-- 
E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi