toStringz and toUTFz potentially unsafe

Sun Jul 24 18:05:12 PDT 2011

On Sunday, July 24, 2011 5:56:04 PM, Jonathan M Davis wrote:
> On Sunday 24 July 2011 20:41:47 Johann MacDonagh wrote:
>> Both toStringz and toUTFz do something potentially unsafe. Both check
>> whether the character after the end of the string is NULL. If so, then
>> it simply returns a pointer to the original string. This is a good
>> optimization in theory because this code:
>>
>> string s = "abc";
>>
>> will be a slice to a read-only section of the executable. The compiler
>> will insert a NULL after the string in the read-only section. So this:
>>
>> auto x = toStringz("abc");
>>
>> is efficient. No relocations.
>>
>> As @AndrejMitrovic commented in Phobos pull request 123
>> https://github.com/D-Programming-Language/phobos/pull/123, this has
>> potential issues:
>>
>> import std.string;
>> import std.stdio;
>>
>> struct A
>> {
>>      immutable char[2] foo;
>>      char[2] bar;
>> }
>>
>> void main()
>> {
>>      auto a = A("aa", "\0b");
>>      auto charptr = toStringz
(a.foo[]);
>>
>>      a.bar = "bo";
>>      printf(charptr);   // two chars, then garbage
>> }
>>
>> Another issue not mentioned is with slices. If I do...
>>
>> string s = "abc";
>> string y = s[];
>> string z = y[];
>>
>> z ~= '\0';
>>
>> auto c = toStringz(y);
>>
>> assert(c.ptr == y.ptr);
>>
>> ... what happens if I change that last character of z before I pass c to
>> the C routine? Bad news. I think this optimization is great, but doesn't
>> it go against D's motto of doing "the right thing by default"?
>>
>> The question is, how can we keep this optimization so that:
>>
>> toStringz("abc");
>>
>> remains efficient?
>>
>> The capacity field is 0 if the string is in a read-only section *or* if
>> the string is on the stack:
>>
>> auto x = "abc";
>> assert(x.capacity == 0);
>> char[3] y = "abc";
>> assert(x.capacity == 0);
>>
>> So, this isn't safe either. This code:
>>
>>      char[3] x = "abc";
>>      char y = '\0';
>>
>> will put y right after x, so changing y after c
alling toStringz will
>> cause issues.
>>
>> In reality, the only time it's safe to do the "peek after end" is if the
>> string is in the read-only section. Otherwise, there are potential
>> issues (even if they are edge cases).
>>
>> Do we care about this? Is there something we can add to druntime arrays
>> that will tell whether or not the data backing a slice is in read-only
>> memory (or perhaps an enum: read-only, stack, heap, other)? In reality,
>> the only time this changes is when a read-only / stack heap is appended
>> to, so performance issues are minimal.
>>
>> Comments? Ideas?
> 
> The _only_ time that this matters is if you actually keep the result of 
> toStringz or toUTFz around. If you do what is almost always done - that is 
> pass the the result of toStringz/toUTFz directly to a function and not save it 
> at all - then there is _zero_ problem.
> 
> If you want to save the char* instead, then you can simply choose to append 
> '\0' instead of calling toStrin
gz/toUTFz. toStringz has pretty much 0 value if 
> it's forced to copy the string in all cases, and unless checking past the end 
> of the string _and_ guaranteeing that the character one past the end won't 
> change can be done and done efficiently, then the only way to guarantee that 
> then value one past the end won't change is to make a copy, in which case 
> toStringz is essentially pointless.
> 
> The documentation for toUTFz has an appropriate warning on it, so the issue 
> should be clear. If you want to avoid it, simply append '\0' instead of 
> calling toStringz or toUTFz. In 99.99% of cases, the char* is passed directly 
> to a C function anyway, in which case there is not problem.
> 
> So, I really don't think that this is really an issue. If we can find an 
> efficient way to make better guarantees, then that would be great, but the risk 
> at this point is _very_ minimal, and there's an easy way to get around it in 
> the _rare_ case where it could be a problem
 (simply append '\0' instead of 
> calling toStringz/toUTFz).
> 
> - Jonathan M Davis

Not entire true.  There's one more very important way this can cause a 
serious problem:  when the string in question is at the end of a block 
of memory and the +1 location is outside that.  The result will be a 
segv.

While I agree the chances of problem is low, it's certainly not 0 and 
is the sort of thing that if we saw another language do it we'd laugh 
and point.  This kind of code produces hard to debug problems and 
doesn't belong in the standard library.