The great slice debate -- should slices be separated from arrays?
Steven Schveighoffer
schveiguy at yahoo.com
Tue Nov 24 06:27:22 PST 2009
In many other posts, people have been festering over dropping T[new] and
not having a reference array type. The argument is (and forgive me if
this is a strawman, someone from the other side can post if I'm
incorrect): If we make arrays a separate type from slices, and only allow
appending on arrays, then we solve the stomping problem and the
hard-to-determine reallocating problem. For those who are unfamiliar with
these problems, I'll try to outline them at the bottom of the post.
I contend that even if we make arrays a separate type, even if arrays are
made a true reference type, slices will still suffer from the same
hard-to-determine reallocation problem *unless* we make slices fatter.
My proof is as simple as an example. Assume 'Array' is the new type for
an array:
auto a = new Array!(int)(15); // allocate an array of 15 integers, all 0
auto s = a[0..5];
a ~= [1,2,3,4,5];
a[0] = 1.
Now, what does s[0] equal?
It depends on how s is defined. If s is a simple pointer + length slice
as defined today, then it is *STILL* hard-to-determine because you are
unsure whether a had to reallocate to a new block! If you make s a "fat"
slice, that is, it contains a reference to the *original* array, and 2
indexes, then you have issues:
1. you are now passing around 12 (24 for 64-bit) bytes for a slice instead
of 8 (16), that hurts performance
2. you are now subject to a slice becoming invalid if a decides to
reallocate into a smaller block and your slice indexes data that is now
outside the block.
3. you are now subject to corruption if the array decides to truncate from
the *front* of it's block, since the slice will now refer to different
data.
Because of points 2 and 3, you lose static provability of slices being
valid, since their backing array can reallocate at any time.
Are there any alternative implementations for slices that work better? If
not, I think thin slices (pointer and length) are the way to go, which
leaves us with a still hard-to-determine reallocation strategy. In this
case, why not leave the arrays the way they are?
The answer is, stomping. But we have already found ways to fix that
without changing any code. In the instance where you need the fastest
appending possible, we can create a library type to handle that, then you
convert to an actual array when you need it. I think dsichma already is
working on such a type.
So is there anyone in the "please separate arrays from slices" camp that
can counter this? Are there other solutions someone can think of or has
already brought up that I didn't mention?
======================
Definition of the "hard-to-determine" reallocating problem: (I'd classify
this as non-deterministic reallocating, but it depends on your point of
view, and Walter has fits over that term :)
If you reallocate an array, it may or may not copy the data to another
memory block depending on if it can resize into the existing memory
block. If it can, then the original memory is still reference by the
array. The determination of when this occurs is dependent on the
implementation of the runtime. In the current incarnation of dmd, this
occurs when the array is sized beyond powers of 2 up to a page size, and
then at page size increments.
The trouble is, when you see code like this:
void foo(char[] buf)
{
buf ~= "abcd";
buf[0] = 'z';
}
You cannot tell whether the input array was affected because the append
statement may or may not change the address of buf. If it doesn't, then
the argument passed in is affected. If it does, the argument passed in is
not affected. The current solutions to this problem are to either
document the behavior or dup the incoming buffer before appending (or
using the buf = buf ~ "abcd" form which guarantees reallocation).
======================
Definition of the stomping problem:
If you append to a slice which *starts* at the beginning of a
heap-allocated memory block, and the append operation fits within the
memory block, the compiler reallocates in place. However, if that slice
was a smaller piece of a larger array, it will overwrite the data in the
array. The optimization came into effect when people noticed code like
the following was very slow and ate up too much memory:
int[] x;
for(int i = 0; i < 10000; i++)
x ~= i;
If x reallocates on every loop, it will allocate 10000 times, each time
allocating a larger memory block. This forces more GC collection cycles
and can leave behind scores of unused pages of memory if the collector
doesn't reuse or allow the OS to reclaim them.
The optimization depends on the likelihood of a slice that points to the
beginning of a block being the owner of the memory block (that is, all
other slices were created from it, and therefore do not go beyond the
extents of the primary slice). This is certainly the case after the first
reallocation that occurs in the loop above, making this loop perfectly
safe. However, it is not hard to come up with an unsafe case:
int[] x = [1,2,3];
auto y = x[0..1]; // slice only the first element
y ~= 5; // now x == [1,5,3]
This is even more disturbing for const or immutable data:
string x = "hello".idup; // need the idup to put the data on the heap.
auto y = x[0..1]; // slice only "h";
y ~= "owdy"; // now x, which is supposed to be immutable, changed to
"howdy"
This stomping becomes more of a problem for library writers:
char * toStringZ(string s)
{
s ~= '\0';
return s.ptr;
}
This seemingly innocuous function could easily corrupt immutable data if s
is a slice of the beginning of a larger array. Therefore, toStringZ has
to be defensive:
char * toStringZ(string s)
{
s = s ~ '\0'; // not using ~= operator forces a reallocation
return s.ptr;
}
But this "defensive" style would be unnecessary if stomping could not
occur.
More information about the Digitalmars-d
mailing list