Enhanced array appending

Thu Dec 24 11:58:38 PST 2009

On Thu, 24 Dec 2009 13:39:21 -0500, grauzone <none at example.net> wrote:

> Sorry about that, I didn't have a close look at the patch. I guess I was  
> more thinking about Andrei's original proposal (and how I thought it may  
> be implemented).

No problem.

>
> It seems you store the length field inside the array's memory block  
> (instead of the cache, which just speeds up gc_query). That's awesome,  
> but I'm going to complain again: now you have to keep a length field for  
> *all* memory allocations, not only arrays! For most object allocations,  
> this means 4 more bytes additional overhead.

Interestingly enough, the storage overhead is zero except for memory  
blocks > 256 bytes.  I'll explain:

A probably not well-known piece of trivia is that D allocates 1 extra byte  
for arrays.  Why would it do this you say?  Because of the GC.

Imagine that it does not do this, what happens when you do something like  
this:

ubyte[] array = new ubyte[16]; // allocates a 16-byte block for the array
array = array[$..$];

If you look at array's ptr member, it now no longer points to the  
allocated block, but the next block.  Although it isn't important for the  
GC to keep around the allocated block any more, it now will keep the next  
block from being collected.  In addition, if you tried appending to array,  
it might start using unallocated memory!

So the byte at the end already was in use.  I sort of commandeered it for  
length use.  For blocks up to and including 256 bytes, I use the last byte  
of the block as the length storage.  For blocks of 512 to a half page  
(2048 bytes), I use the last 2 bytes, so there is one extra overhead byte  
compared to the current implementation.

Blocks larger than that follow different rules, they are not stored in  
bins, but just a whole page at a time.  With those blocks, it is possible  
to extend the block by adding more pages if they are free, so it's not OK  
to store the length at the end of the block, since the end of the block  
may change.  So I store at the beginning.  I use up a full 8 bytes, and  
the reason is alignment, I don't know what you're putting in the array, so  
I must keep the data 8 byte-aligned.

For classes, I allocate the required extra data as if the class were an  
array of the class data size, and then set the "ghost" length at the end  
of the block to 0.  If a class data exceeds a half page, the ghost length  
is at the beginning, where the vtable ptr is, so it's extremely unlikely  
to accidentally match that length.  Note that it doesn't make a huge  
difference in most cases because the block used for the class is a power  
of 2 anyways, so in most cases there's plenty of wasted space at the end.

I found out during testing that allocating a new struct is equivalent to  
allocating a new array of that struct of size 1, and returning it's  
pointer, so that aspect is already covered.

> Also, if you use GC.malloc directly, and the user tries to append to  
> slices to it, your code may break. GC.malloc doesn't seem to pad the  
> memory block with a length field.

Yes, this is a possible problem.  However, using GC.malloc directly and  
then treating the result as a normal array is probably extremely rare.  At  
least it is not valid for safe D.  It probably should be explained as a  
danger in GC.malloc, but I don't think this will adversely affect most  
code.  There will be functions that should obviate the need for calling  
GC.malloc to allocate an array (specifically, allocating uninitialized  
data).

> I must say that I find your argumentation strange: didn't you say adding  
> an additional field to the slice struct is too much overhead?

Overhead when passing around, not overhead for storage.  For example,  
pushing 12 bytes onto the stack instead of 8 when calling a function with  
a slice.  If you want safe appends, you need to store the allocation  
length somewhere, there's no way around that.

> Also, a solution which keeps the pointer to the array length in the  
> slice struct would still be faster. The cache lookup cost is not zero.

This is the trade-off I chose between performance when passing around  
slices and using them and performance for appending.  If you focus all  
your performance concerns on appending, then other areas suffer.  I don't  
know if what I chose will be the best solution, but I can't think of any  
way to have *both* speedy slice usage and near-zero append overhead.  If  
someone can think of a better solution, I'll be happy to incorporate it,  
but after using and understanding slices while developing stuff for Tango  
(Tango uses slices to get every ounce of performance!), I'm convinced that  
as-fast-as-possible slice semantics for passing around data is essential.

This is where a custom type that focuses performance on appending at the  
expense of other functions is good to have.  I think such applications  
where you need the absolute best append performance without any other  
functionality are pretty rare.

> Anyway, now that semantics and performance are somewhat sane, none of  
> these remaining issues are too important. Thanks Steve and Andrei!

Cool!

-Steve