toStringz and toUTFz potentially unsafe

Sun Jul 24 18:06:20 PDT 2011

On Sunday 24 July 2011 17:56:04 Jonathan M Davis wrote:
> On Sunday 24 July 2011 20:41:47 Johann MacDonagh wrote:
> > Both toStringz and toUTFz do something potentially unsafe. Both check
> > whether the character after the end of the string is NULL. If so, then
> > it simply returns a pointer to the original string. This is a good
> > optimization in theory because this code:
> > 
> > string s = "abc";
> > 
> > will be a slice to a read-only section of the executable. The compiler
> > will insert a NULL after the string in the read-only section. So this:
> > 
> > auto x = toStringz("abc");
> > 
> > is efficient. No relocations.
> > 
> > As @AndrejMitrovic commented in Phobos pull request 123
> > https://github.com/D-Programming-Language/phobos/pull/123, this has
> > potential issues:
> > 
> > import std.string;
> > import std.stdio;
> > 
> > struct A
> > {
> > 
> >      immutable char[2] foo;
> >      char[2] bar;
> > 
> > }
> > 
> > void main()
> > {
> > 
> >      auto a = A("aa", "\0b");
> >      auto charptr = toStringz(a.foo[]);
> >      
> >      a.bar = "bo";
> >      printf(charptr);   // two chars, then garbage
> > 
> > }
> > 
> > Another issue not mentioned is with slices. If I do...
> > 
> > string s = "abc";
> > string y = s[];
> > string z = y[];
> > 
> > z ~= '\0';
> > 
> > auto c = toStringz(y);
> > 
> > assert(c.ptr == y.ptr);
> > 
> > ... what happens if I change that last character of z before I pass c to
> > the C routine? Bad news. I think this optimization is great, but doesn't
> > it go against D's motto of doing "the right thing by default"?
> > 
> > The question is, how can we keep this optimization so that:
> > 
> > toStringz("abc");
> > 
> > remains efficient?
> > 
> > The capacity field is 0 if the string is in a read-only section *or* if
> > the string is on the stack:
> > 
> > auto x = "abc";
> > assert(x.capacity == 0);
> > char[3] y = "abc";
> > assert(x.capacity == 0);
> > 
> > So, this isn't safe either. This code:
> >      char[3] x = "abc";
> >      char y = '\0';
> > 
> > will put y right after x, so changing y after calling toStringz will
> > cause issues.
> > 
> > In reality, the only time it's safe to do the "peek after end" is if the
> > string is in the read-only section. Otherwise, there are potential
> > issues (even if they are edge cases).
> > 
> > Do we care about this? Is there something we can add to druntime arrays
> > that will tell whether or not the data backing a slice is in read-only
> > memory (or perhaps an enum: read-only, stack, heap, other)? In reality,
> > the only time this changes is when a read-only / stack heap is appended
> > to, so performance issues are minimal.
> > 
> > Comments? Ideas?
> 
> The _only_ time that this matters is if you actually keep the result of
> toStringz or toUTFz around. If you do what is almost always done - that is
> pass the the result of toStringz/toUTFz directly to a function and not save
> it at all - then there is _zero_ problem.
> 
> If you want to save the char* instead, then you can simply choose to append
> '\0' instead of calling toStringz/toUTFz. toStringz has pretty much 0 value
> if it's forced to copy the string in all cases, and unless checking past
> the end of the string _and_ guaranteeing that the character one past the
> end won't change can be done and done efficiently, then the only way to
> guarantee that then value one past the end won't change is to make a copy,
> in which case toStringz is essentially pointless.
> 
> The documentation for toUTFz has an appropriate warning on it, so the issue
> should be clear. If you want to avoid it, simply append '\0' instead of
> calling toStringz or toUTFz. In 99.99% of cases, the char* is passed
> directly to a C function anyway, in which case there is not problem.
> 
> So, I really don't think that this is really an issue. If we can find an
> efficient way to make better guarantees, then that would be great, but the
> risk at this point is _very_ minimal, and there's an easy way to get around
> it in the _rare_ case where it could be a problem (simply append '\0'
> instead of calling toStringz/toUTFz).

The real question is what to do with to!(char*)(str). The plan is to make it 
call toUTFz, but at that point, the warning about toUTFz is not as obvious 
(though it can re-iterate the warning or point you to the toUTFz documentation 
to read it). Also, since you already have toUTFz, calling to!(char*) is kind 
of pointless. So, I think that there's a good argument for forcing to!(char*) 
to append '\0' instead of checking one past the end. Then when you want a 
guarantee that the '\0' isn't going to change, you can use to!(char*), and if 
you want the efficiency, you can call toUTFz. But it is debatable whether we 
should do that or just have to!(char*) call toUTFz in all cases. I'm leaning 
towards making it always copy though.

- Jonathan M Davis