C strings - byte, ubyte or char? (Discussion from Bugzilla)

Thu Oct 4 06:40:50 PDT 2007

This bug report
http://d.puremagic.com/issues/show_bug.cgi?id=1357
"Cannot use FFFF and FFFE in Unicode escape sequences."

has drifted into a discussion about which type should be used for D bindings 
of C string-processing functions.  I hereby propose that we continue the 
discussion here.

I'll summarise what's been discussed so far here, and add a few points.

First it was proposed that the C string type should become ubyte*, rather 
than char*, in D.  (Actually, whether it should be byte or ubyte depends on 
whether the underlying C implementation treats unqualified char as signed or 
unsigned.  But it probably doesn't matter to the implementations of most 
string functions.)

C APIs use the char type and its derivatives for two main purposes:
- code units in an arbitrary 8-bit character encoding
- byte-sized integer values of arbitrary semantics

Both these uses distinguish it from D's char type, which is intended 
specifically for holding UTF-8 code units.  For both these uses, byte or 
ubyte is more appropriate, and so it would make sense to have these types as 
the standard in D for communicating such data with C APIs.

This would entail:
- changing the string functions in std.c.* to use ubyte where they use char 
at the moment
- changing std.string.toStringz to return ubyte* instead of char*, and 
having a corresponding char[] toString(ubyte*) function
- changing the language to allow string literals to serve as ubyte* as well 
as the types that they already serve as.

A number of C types have been renamed in D:
http://www.digitalmars.com/d/htod.html
"Type Mappings"

Taking the view that C's char type is renamed ubyte would be just another of 
these.

I'm now going to respond to Matti's latest comments on Bugzilla:

> It'd have to go beyond just string literals:
>
> string foo = "asdf";
> int i = strlen(foo.ptr);
>
> Bad example, I know (who needs strlen?), but the above should work
> without having to cast foo.ptr from char* (or invariant(char)* if
> that's what it is in 2.0) to ubyte*.  Passing through toStringz at
> every call may not be an option.

Why might it not be an option?  And what about having to pass it through 
toStringz anyway, for the very reason toStringz exists in the first place?

>> But is it really an inconsistency?  Really, all that's happened is
>> that C's signed char has been renamed as byte, and C's unsigned
>> char as ubyte.  It's no more inconsistent than unsigned int being
>> renamed uint, and long long being renamed long.
>
> No, not really, but Walter seems to think it important that code
> that looks like C should work like it does in C.  I agree with that
> sentiment to a point, and thus minimizing such inconsistencies is a
> good idea.  In this case, however, I'd rather have the
> inconsistency.

The "looks like C, acts like C" principle doesn't seem to be consistently 
applied - the switch default error and the renaming of long long to long are 
just two places where it breaks down.  But this is a case where the 
difference would be in whether it compiles or not, and so it's more a matter 
of D not being source-compatible with C, which is a good design decision 
indeed.

Further comments?

Stewart.

-- 
My e-mail address is valid but not my primary mailbox.  Please keep replies 
on the 'group where everybody may benefit.