D array expansion and non-deterministic re-allocation

Tue Nov 24 04:47:57 PST 2009

On Mon, 23 Nov 2009 18:34:48 -0500, Leandro Lucarella <llucax at gmail.com>  
wrote:

> Steven Schveighoffer, el 23 de noviembre a las 15:18 me escribiste:
>> On Mon, 23 Nov 2009 11:10:48 -0500, Leandro Lucarella <llucax at gmail.com>
>> wrote:
>> >
>> >The thing is, with realloc() is less likely that you forget that the  
>> data
>> >can be copied because it returns the new pointer (that can be the same  
>> as
>> >the original pointer). And even in this case, I saw a lot of bugs  
>> related
>> >to realloc() misuse (and I made a couple myself).
>> >
>> >With slices is much worse.
>>
>> realloc is worse.  If I have multiple aliases to the same data, then
>> I realloc one of those, the others all can become dangling pointers
>> if the runtime decides to move the data.
>
> Well, you are comparing GC vs no-GC, not realloc() vs. slices. I have no
> intention on following that discussion, I'm just saying that realloc() is
> less error-prone than slices (and realloc() is error-prone already).

I was originally saying that realloc is similar to appending because it  
may or may not choose to move the data.
The only difference from realloc in slices is that your other aliases to  
the old data don't become invalidated, so there are no dangling pointers.   
The GC is partly the cause for that, but also the fact that slices don't  
deallocate the original data (that would be possible, even in a GC'd  
environment).

Your assertion that realloc is less error prone doesn't make any sense to  
me.  realloc is as error prone as appending slices *and* it can also leave  
dangling pointers.

>> You also cannot realloc data that's not malloc'd but you can append to
>> a slice of non-heap data without issue.
>
> How is that? AFAIK slices uses gc_realloc() to do the actual realloc, if
> that's done in a piece of malloc'ed data it will fail.

I believe step 1 of the append process is to find the GC heap block that  
contains the slice.  If it doesn't, it simply copies the data to a new  
block.  If the original data is stack-allocated or allocated outside the  
GC, no errors occur AFAIK.

> And even if it were
> true, I don't really see this as a big source of bugs, I really never had
> a bug because I tried to realloc() a non-heap piece of memory or  
> appending
> to a slice of non-GC-heap memory either.

You probably didn't have such bugs because you probably didn't think of  
doing this in C:

void foo(char *arg)
{
    realloc(arg, 10000);
}

Oops, what if arg isn't malloc'd?  But in D this is safe code, and works  
as you expect, no matter what size arg was to begin with.  In fact, tango  
uses this all the time for buffer optimization.  Basically, you pass in a  
stack buffer, if it's big enough, you save the penalty of heap  
allocation.  If it's not, no big deal, just resize the buffer and the GC  
takes care of the rest.

void foo(char[] arg)
{
    arg.length = 10000;
}

>
>> No matter what you do with slice appending in D, you cannot access
>> dangling pointers unless you access the slice's ptr field.
>
> Again, that's only because D is GCed, not because slices are great.

We are comparing realloc to slices.  You assert slices are worse and I  
gave you a reason why they are not.  You can't just ignore that because  
comparing slices to realloc requires taking the GC into account.

>
>> Yes, you can run into trouble if you append to a slice and then
>> change the original data in the slice, but that is a rare event, and
>> it won't result in corrupting memory you didn't already have access
>> to (at least, with the MRU cache).
>
> I'm a little lost now. I don't know of what hypothetical D are you  
> talking
> about.

Here's the deal.  You have two very common operations on an array:  
appending and modification.  In practice you either append data or you  
modify data.  Rarely do you append data *and* modify the original data,  
and the only case I've seen for that is for the buffer optimization trick  
above.  This is the only case where slices might get you  
"non-deterministic" behavior (and that's only if you still want to use the  
original array before appending).  I have no proof of this except for my  
experience reading tango code.

The two major use cases for appending are, building an array of items  
using append, and using an array as a buffer.  In the first case, you  
start out with an empty or fully owned array, so no harm.  In the second  
case, there are actually two types of functions that accept buffers, ones  
that return a slice of the buffer, and ones which don't.  The ones that  
return a slice, you use the slice, not the original buffer.  The ones  
which don't return a slice, you use the original buffer if you expect  
important data to be there.

I don't think there's ever a case where you see a function that takes an  
array append to the array, then modify the original part of the array  
*without* returning the result.

> I can't see how the MRU cache can provide any safety. The cache is
> finite, and not all the slices will fit in it, so for those slices that
> are not cached, I don't see how the cache can provide any safety.

In those cases, the slice is reallocated because the runtime isn't sure  
whether it will result in stomping.  The MRU cache is an optimization  
which defaults to reallocation for safety.

Andrei has mentioned that he thinks we can store the allocated length in  
the GC block, which I think would also work.  You also wouldn't need an  
MRU cache in that case, but he says it's in *addition* to the MRU cache,  
so I'm not sure what he means.

> Safety can be provided if a ~= b is defined to be semantically the same  
> as
> a = a ~ b, and leaving the MRU cache as an optimization.

Yes, exactly.  That is what the MRU cache does.  It's no good if the  
current behavior becomes the fallback, the fallback must be a reallocation.

> In that case we
> agree slices are predictable and a little safer (because stomping is not
> possible). But they are still error prone if you expect them to be a full
> reference type. Being the only entity in D with such semantics, is
> something one can forget very easily and introduce subtle bugs. In this
> case, I really think providing ~= is a bad idea, it's just too error  
> prone
> and doesn't give you anything.

Providing the simple syntax for simple cases is what this does.  I think  
it's worth having in the language.

Let's all face it: slices and arrays being the same thing is a new concept  
that most of us have only seen in D (I don't know many languages, but  
probably it's been done before).  And the hybrid value/reference nature of  
a slice is definitely new to me.  I think it provides great power and  
flexibility that makes code just work the way you want it to *in most  
cases*.  If you read a lot of the tango code, you see how elegant things  
can be with slices.  But it's always going to be confusing to newcomers.   
It's like explaining pointers for the first time to someone.

>
> I still think the best is to just make slices immutable value types (you
> can mutate the data they point to, you just can't modify slices; ptr and
> length), and provide a proper dynamic array type in the library or
> something.
>

I think you mean slices should only be *shrinkable*, immutable makes no  
sense, you need to be able to rebind a slice.

I don't see the harm in allowing appending as long as it's safe.

Having a separate type in the library that optimizes appending and has  
complete reference semantics is fine with me, just leave slices and arrays  
alone (well, except the fix for stomping).  They *can* live together in  
the same language.

-Steve