C strings - byte, ubyte or char? (Discussion from Bugzilla)

Thu Oct 4 07:26:24 PDT 2007

Stewart Gordon wrote:
> This bug report
> http://d.puremagic.com/issues/show_bug.cgi?id=1357
> "Cannot use FFFF and FFFE in Unicode escape sequences."
> 
> has drifted into a discussion about which type should be used for D
> bindings of C string-processing functions.  I hereby propose that we
> continue the discussion here.

Good idea. But note that I'm not talking only about C string-processing
functions: in general, any functions which process strings without regard to
their encoding should use ubytes.

Just about all of std.string are such, for instance. The Tango situation is
better, since tango.text.Util is already templated for char/wchar/dchar: ubyte
would need to be added to the mix.

<snip>
>> It'd have to go beyond just string literals:
>>
>> string foo = "asdf";
>> int i = strlen(foo.ptr);
>>
>> Bad example, I know (who needs strlen?), but the above should work
>> without having to cast foo.ptr from char* (or invariant(char)* if
>> that's what it is in 2.0) to ubyte*.  Passing through toStringz at
>> every call may not be an option.
> 
> Why might it not be an option?  And what about having to pass it through
> toStringz anyway, for the very reason toStringz exists in the first place?

One problem with toStringz is efficiency. Its current implementation of performs
a string concatenation every time. If you know the string is zero terminated and
ASCII (or you just want it to be handled as encoding-agnostic), you should just
be able to pass it through.

But on second thought, having the cast (or a call to toStringz) be necessary
might be better. If you want UTF-8 to be handled as encoding-agnostic, a
necessary cast may be a good idea, as it implies you know what you're doing.

-- 
E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi