toStringz and toUTFz potentially unsafe

Sun Jul 24 17:45:23 PDT 2011

On 7/24/11 7:41 PM, Johann MacDonagh wrote:
> Both toStringz and toUTFz do something potentially unsafe. Both check
> whether the character after the end of the string is NULL. If so, then
> it simply returns a pointer to the original string. This is a good
> optimization in theory because this code:
>
> string s = "abc";
>
> will be a slice to a read-only section of the executable. The compiler
> will insert a NULL after the string in the read-only section. So this:
>
> auto x = toStringz("abc");
>
> is efficient. No relocations.
>
> As @AndrejMitrovic commented in Phobos pull request 123
> https://github.com/D-Programming-Language/phobos/pull/123, this has
> potential issues:
>
> import std.string;
> import std.stdio;
>
> struct A
> {
> immutable char[2] foo;
> char[2] bar;
> }
>
> void main()
> {
> auto a = A("aa", "\0b");
> auto charptr = toStringz(a.foo[]);
>
> a.bar = "bo";
> printf(charptr); // two chars, then garbage
> }
>
> Another issue not mentioned is with slices. If I do...
>
> string s = "abc";
> string y = s[];
> string z = y[];
>
> z ~= '\0';
>
> auto c = toStringz(y);
>
> assert(c.ptr == y.ptr);
>
> ... what happens if I change that last character of z before I pass c to
> the C routine? Bad news. I think this optimization is great, but doesn't
> it go against D's motto of doing "the right thing by default"?
>
> The question is, how can we keep this optimization so that:
>
> toStringz("abc");
>
> remains efficient?
>
> The capacity field is 0 if the string is in a read-only section *or* if
> the string is on the stack:
>
> auto x = "abc";
> assert(x.capacity == 0);
> char[3] y = "abc";
> assert(x.capacity == 0);
>
> So, this isn't safe either. This code:
>
> char[3] x = "abc";
> char y = '\0';
>
> will put y right after x, so changing y after calling toStringz will
> cause issues.
>
> In reality, the only time it's safe to do the "peek after end" is if the
> string is in the read-only section. Otherwise, there are potential
> issues (even if they are edge cases).
>
> Do we care about this? Is there something we can add to druntime arrays
> that will tell whether or not the data backing a slice is in read-only
> memory (or perhaps an enum: read-only, stack, heap, other)? In reality,
> the only time this changes is when a read-only / stack heap is appended
> to, so performance issues are minimal.
>
> Comments? Ideas?

I'm not too worried. I think it is fair to guarantee the pointer 
returned from toStringz is guaranteed to point to a zero-terminated 
string only up until the first relevant change. It is difficult to 
define what a relevant change is, but practically I think it is 
understood what's going on. If there's a need for a persistent stringz, 
creating a private copy immediately after the call to toStringz is 
always an option.

Andrei