The great slice debate -- should slices be separated from arrays?

Tue Nov 24 06:27:22 PST 2009

In many other posts, people have been festering over dropping T[new] and  
not having a reference array type.  The argument is (and forgive me if  
this is a strawman, someone from the other side can post if I'm  
incorrect): If we make arrays a separate type from slices, and only allow  
appending on arrays, then we solve the stomping problem and the  
hard-to-determine reallocating problem.  For those who are unfamiliar with  
these problems, I'll try to outline them at the bottom of the post.

I contend that even if we make arrays a separate type, even if arrays are  
made a true reference type, slices will still suffer from the same  
hard-to-determine reallocation problem *unless* we make slices fatter.

My proof is as simple as an example.  Assume 'Array' is the new type for  
an array:

auto a = new Array!(int)(15); // allocate an array of 15 integers, all 0
auto s = a[0..5];
a ~= [1,2,3,4,5];
a[0] = 1.

Now, what does s[0] equal?

It depends on how s is defined.  If s is a simple pointer + length slice  
as defined today, then it is *STILL* hard-to-determine because you are  
unsure whether a had to reallocate to a new block!  If you make s a "fat"  
slice, that is, it contains a reference to the *original* array, and 2  
indexes, then you have issues:

1. you are now passing around 12 (24 for 64-bit) bytes for a slice instead  
of 8 (16), that hurts performance
2. you are now subject to a slice becoming invalid if a decides to  
reallocate into a smaller block and your slice indexes data that is now  
outside the block.
3. you are now subject to corruption if the array decides to truncate from  
the *front* of it's block, since the slice will now refer to different  
data.

Because of points 2 and 3, you lose static provability of slices being  
valid, since their backing array can reallocate at any time.

Are there any alternative implementations for slices that work better?  If  
not, I think thin slices (pointer and length) are the way to go, which  
leaves us with a still hard-to-determine reallocation strategy.  In this  
case, why not leave the arrays the way they are?

The answer is, stomping.  But we have already found ways to fix that  
without changing any code.  In the instance where you need the fastest  
appending possible, we can create a library type to handle that, then you  
convert to an actual array when you need it.  I think dsichma already is  
working on such a type.

So is there anyone in the "please separate arrays from slices" camp that  
can counter this?  Are there other solutions someone can think of or has  
already brought up that I didn't mention?

======================

Definition of the "hard-to-determine" reallocating problem:  (I'd classify  
this as non-deterministic reallocating, but it depends on your point of  
view, and Walter has fits over that term :)

If you reallocate an array, it may or may not copy the data to another  
memory block depending on if it can resize into the existing memory  
block.  If it can, then the original memory is still reference by the  
array.  The determination of when this occurs is dependent on the  
implementation of the runtime.  In the current incarnation of dmd, this  
occurs when the array is sized beyond powers of 2 up to a page size, and  
then at page size increments.

The trouble is, when you see code like this:

void foo(char[] buf)
{
    buf ~= "abcd";
    buf[0] = 'z';
}

You cannot tell whether the input array was affected because the append  
statement may or may not change the address of buf.  If it doesn't, then  
the argument passed in is affected.  If it does, the argument passed in is  
not affected.  The current solutions to this problem are to either  
document the behavior or dup the incoming buffer before appending (or  
using the buf = buf ~ "abcd" form which guarantees reallocation).

======================

Definition of the stomping problem:

If you append to a slice which *starts* at the beginning of a  
heap-allocated memory block, and the append operation fits within the  
memory block, the compiler reallocates in place.  However, if that slice  
was a smaller piece of a larger array, it will overwrite the data in the  
array.  The optimization came into effect when people noticed code like  
the following was very slow and ate up too much memory:

int[] x;

for(int i = 0; i < 10000; i++)
    x ~= i;

If x reallocates on every loop, it will allocate 10000 times, each time  
allocating a larger memory block.  This forces more GC collection cycles  
and can leave behind scores of unused pages of memory if the collector  
doesn't reuse or allow the OS to reclaim them.

The optimization depends on the likelihood of a slice that points to the  
beginning of a block being the owner of the memory block (that is, all  
other slices were created from it, and therefore do not go beyond the  
extents of the primary slice).  This is certainly the case after the first  
reallocation that occurs in the loop above, making this loop perfectly  
safe.  However, it is not hard to come up with an unsafe case:

int[] x = [1,2,3];
auto y = x[0..1]; // slice only the first element
y ~= 5; // now x == [1,5,3]

This is even more disturbing for const or immutable data:

string x = "hello".idup; // need the idup to put the data on the heap.
auto y = x[0..1]; // slice only "h";
y ~= "owdy"; // now x, which is supposed to be immutable, changed to  
"howdy"

This stomping becomes more of a problem for library writers:

char * toStringZ(string s)
{
    s ~= '\0';
    return s.ptr;
}

This seemingly innocuous function could easily corrupt immutable data if s  
is a slice of the beginning of a larger array.  Therefore, toStringZ has  
to be defensive:

char * toStringZ(string s)
{
    s = s ~ '\0'; // not using ~= operator forces a reallocation
    return s.ptr;
}

But this "defensive" style would be unnecessary if stomping could not  
occur.