toStringz and toUTFz potentially unsafe

Mon Jul 25 06:27:34 PDT 2011

On Sun, 24 Jul 2011 20:41:47 -0400, Johann MacDonagh  
<johann.macdonagh.no at spam.gmail.com> wrote:

> Both toStringz and toUTFz do something potentially unsafe. Both check  
> whether the character after the end of the string is NULL. If so, then  
> it simply returns a pointer to the original string.

It is of critical importance to note that this *ONLY* happens for  
immutable data.  It is not done for const or mutable strings (instead a  
zero is always appended).

> Another issue not mentioned is with slices. If I do...
>
> string s = "abc";
> string y = s[];
> string z = y[];
>
> z ~= '\0';
>
> auto c = toStringz(y);
>
> assert(c.ptr == y.ptr);
>
> ... what happens if I change that last character of z before I pass c to  
> the C routine?

Not possible.  There are two important errors in your assumptions.  First,  
z is an immutable(char)[], so you cannot change the last character without  
a cast.  And if you do that, it's undefined behavior.

Second, z.ptr != y.ptr.  You cannot append to a ROM string, it will  
reallocate.

However, it's quite possible that *unallocated* data may change.  That is:

string s = "abc".dup;  // it's possible that the data after "abc" is a 0.   
The GC does not default initialize the unallocated portion of a data block  
if you are allocating a NO_SCAN block (for efficiency).

auto c = toStringz(s); // might just return s.ptr.

s ~= 'a'; // now if c == s.ptr, c is affected.

However, I still am not concerned about this (see my statement at the end  
of this message).

> Bad news. I think this optimization is great, but doesn't it go against  
> D's motto of doing "the right thing by default"?
>
> The question is, how can we keep this optimization so that:
>
> toStringz("abc");
>
> remains efficient?
>
> The capacity field is 0 if the string is in a read-only section *or* if  
> the string is on the stack:
>
> auto x = "abc";
> assert(x.capacity == 0);
> char[3] y = "abc";
> assert(x.capacity == 0);
>
> So, this isn't safe either. This code:
>
>      char[3] x = "abc";
>      char y = '\0';
>
> will put y right after x, so changing y after calling toStringz will  
> cause issues.

A not-so-small flaw in this argument is char[3] x is not immutable -- as I  
stated above, only an immutable string causes the optimization to occur.

It's highly unlikely anyone would ever write:

immutable char[3] y = "abc";

> In reality, the only time it's safe to do the "peek after end" is if the  
> string is in the read-only section. Otherwise, there are potential  
> issues (even if they are edge cases).
>
> Do we care about this? Is there something we can add to druntime arrays  
> that will tell whether or not the data backing a slice is in read-only  
> memory (or perhaps an enum: read-only, stack, heap, other)? In reality,  
> the only time this changes is when a read-only / stack heap is appended  
> to, so performance issues are minimal.

I think there could be a compromise solution:

If the slice is for immutable data (i.e. string), AND the slice is of a  
heap-allocated array (the runtime can determine this via a block lookup),  
AND the characer after the slice is part of the used block data, it would  
be safe to assume that data is also immutable.  There are remote  
possibilities that this could not be the case, but this involves  
improbable casting scenarios.

Note that the lookup cost is not trivial -- it's O(lgn), and it requires a  
GC lock.

This, of course, would have to be in addition to checking if the string is  
in ROM.  I don't know if this is possible...

BTW, I agree with Jonathan that this is a very rare corner-case (it might  
even be non-existent in real code).

I've only ever used toStringz or seen it used as an expression for a  
parameter -- I've never actually saved the result and did other stuff  
before passing it.

I think the warning he states in documentation is sufficient.

-Steve