C strings - byte, ubyte or char? (Discussion from Bugzilla)
Stewart Gordon
smjg_1998 at yahoo.com
Thu Oct 4 06:40:50 PDT 2007
This bug report
http://d.puremagic.com/issues/show_bug.cgi?id=1357
"Cannot use FFFF and FFFE in Unicode escape sequences."
has drifted into a discussion about which type should be used for D bindings
of C string-processing functions. I hereby propose that we continue the
discussion here.
I'll summarise what's been discussed so far here, and add a few points.
First it was proposed that the C string type should become ubyte*, rather
than char*, in D. (Actually, whether it should be byte or ubyte depends on
whether the underlying C implementation treats unqualified char as signed or
unsigned. But it probably doesn't matter to the implementations of most
string functions.)
C APIs use the char type and its derivatives for two main purposes:
- code units in an arbitrary 8-bit character encoding
- byte-sized integer values of arbitrary semantics
Both these uses distinguish it from D's char type, which is intended
specifically for holding UTF-8 code units. For both these uses, byte or
ubyte is more appropriate, and so it would make sense to have these types as
the standard in D for communicating such data with C APIs.
This would entail:
- changing the string functions in std.c.* to use ubyte where they use char
at the moment
- changing std.string.toStringz to return ubyte* instead of char*, and
having a corresponding char[] toString(ubyte*) function
- changing the language to allow string literals to serve as ubyte* as well
as the types that they already serve as.
A number of C types have been renamed in D:
http://www.digitalmars.com/d/htod.html
"Type Mappings"
Taking the view that C's char type is renamed ubyte would be just another of
these.
I'm now going to respond to Matti's latest comments on Bugzilla:
> It'd have to go beyond just string literals:
>
> string foo = "asdf";
> int i = strlen(foo.ptr);
>
> Bad example, I know (who needs strlen?), but the above should work
> without having to cast foo.ptr from char* (or invariant(char)* if
> that's what it is in 2.0) to ubyte*. Passing through toStringz at
> every call may not be an option.
Why might it not be an option? And what about having to pass it through
toStringz anyway, for the very reason toStringz exists in the first place?
>> But is it really an inconsistency? Really, all that's happened is
>> that C's signed char has been renamed as byte, and C's unsigned
>> char as ubyte. It's no more inconsistent than unsigned int being
>> renamed uint, and long long being renamed long.
>
> No, not really, but Walter seems to think it important that code
> that looks like C should work like it does in C. I agree with that
> sentiment to a point, and thus minimizing such inconsistencies is a
> good idea. In this case, however, I'd rather have the
> inconsistency.
The "looks like C, acts like C" principle doesn't seem to be consistently
applied - the switch default error and the renaming of long long to long are
just two places where it breaks down. But this is a case where the
difference would be in whether it compiles or not, and so it's more a matter
of D not being source-compatible with C, which is a good design decision
indeed.
Further comments?
Stewart.
--
My e-mail address is valid but not my primary mailbox. Please keep replies
on the 'group where everybody may benefit.
More information about the Digitalmars-d
mailing list