In-place extension of arrays only for certain alignment?

Thu Aug 18 01:31:16 UTC 2022

On 8/17/22 2:40 PM, Ali Çehreli wrote:
> On 8/16/22 19:33, Steven Schveighoffer wrote:
> 
> Using a 16-byte block sounds like a good strategy at first because 
> nobody knows whether an array will get more than one element.
> 
> However, if my guess is correct (i.e. the first element of size of 
> 16-bytes is placed on a 16-byte block), then the next allocation will 
> always allocate memory for the second element.

A 16-byte element size must be put in a 32-byte block, you still need 
one byte for the metadata.

> 
> One might argue that dynamic arrays are likely to have more than a 
> single element, so the initial block should at least be twice the 
> element size. This would cut memory allocation by 1 count for all 
> arrays. And in my case of 1-element arrays, allocation count would be 
> halved. (Because I pay for every single append right now.)

So yes, if you have a 32-byte block for 16-byte elements, it means you 
can only fit one element in the block. If you are using a sliding window 
approach, where you remove the first element and then append another, 
you will in effect reallocate on every append.

Using the append/popFront mechanism to implement your sliding window is 
going to perform badly. Appending is not designed to make this situation 
perform well.

> That all makes sense. I didn't think the meta data would be at the end 
> but I sense it's related to the "end slice", so it's a better place 
> there. (?)

It's for alignment. If I put 1 byte at the front, this means I have to 
always skip 7 or 15 more bytes (depending on alignment requirements).

BUT, I put the metadata at the front on big (page+ size) blocks, because 
I can both afford to skip 16 bytes in a block of 4096, and if you extend 
the block, there is no need to move the metadata later. Consider that 
the metadata lookup cache could be out of date if it had to move.

>  > What is your focus? Why do you really want this "optimization" of gluing
>  > together items to happen?
> 
> This is about what you and I talked about in the past and something I 
> mentioned in my DConf lightning talk this year. I am imagining a 
> FIFO-like cache where elements of a source range are stored. There is a 
> sliding window involved. I want to drop the unused front elements 
> because they are expensive in 2 ways:
> 
> 1) If the elements are not needed anymore, I have to move my slice 
> forward so that the GC collects their pages.
> 
> 2) If they are not needed anymore, I don't want to even copy them to a 
> new block because this would be expensive, and in the case of an 
> infinite source range, impossible.

Ah! I actually have a solution for this in iopipe -- a ring buffer. 
Basically, you map the same physical pages of memory sequentially. It 
allows you to simply change the pointer, and never need to copy anything.

See this code for an example (I only have it for Posix, but Windows has 
similar features, I have to add them): 
https://github.com/schveiguy/iopipe/blob/6a8c10d2858f92978d72c55eecc7ad55fcc207e2/source/iopipe/buffer.d#L306

> The question is when to apply this dropping of old front elements. When 
> I need to add one more element to the array, I can detect whether this 
> *may* allocate by the expression 'arr.length == arr.capacity' but not 
> really though, because the runtime may give me adjacent room without 
> allocation. So I can delay the "drop the front elements" computation 
> because there will be no actual copying at this time.

Even if you could do this, this doesn't help because at the end of the 
pool, you need to reallocate into a another pool (or back to the 
beginning of the pool), because there can be no free pages after the 
last page in the pool (you can't merge pools together).

>  > https://dlang.org/phobos/core_memory.html#.GC.extend
> 
> Ok, that sounds very useful. In addition to "I can detect when it *may* 
> allocate", I can detect whether there is adjacent free room. (I can ask 
> for just 1 element extension; I tested; and it works.) (I guess this 
> GC.extend has to grab a GC lock.)
> 
> However, for that to work, I seem to need the initial block pointer that 
> the GC knows about. (My sliding array's .ptr not work, so I have to save 
> the initial arr.ptr).

You can get this via `GC.query`, but that means 2 calls into the GC.

> Conclusion:
> 
> 1) Although understanding the inner workings of the runtime is very 
> useful and core.memory has interesting functions, it feels too much 
> fragile work to get exactly what I want. I should manage my own memory 
> (likely still backed by the GC).
> 
> 2) I argue that the initial allocation should be more than 1 element so 
> that we skip 1 allocation for most arrays and 50% for my window-of-1 
> sliding window case.

So 2 things here:

1. I highly recommend trying out the ring buffer solution to see if it 
helps. The disadvantage here is that you need to tie up at least a page 
of memory.
2. All my tests using the ring buffer showed little to no performance 
improvement over just copying back to the front of the buffer. So 
consider just copying the data back to the front of an already allocated 
block.

IIRC, your data does not need to be sequential in *physical memory*, 
which means you can use a ring buffer that is segmented instead of 
virtually mapped, and that can be of any size.

-Steve