In-place extension of arrays only for certain alignment?

Wed Aug 17 18:40:06 UTC 2022

On 8/16/22 19:33, Steven Schveighoffer wrote:

 > Everything in the memory allocator is in terms of pages. A pool is a
 > section of pages. The large blocks are a *multiple* of pages, whereas
 > the small blocks are pages that are *divided* into same-sized chunks.

Thank you. I am appreciating this discussion a lot.

 > When you want to allocate a block, if it's a half-page or less, then it
 > goes into the small pool, where you can't glue 2 blocks together.

That's how it should be because we wouldn't want to work to know which 
blocks are glued.

 >>  > The reason why your `bad` version fails is because when it must
 >>  > reallocate, it still is only allocating 1 element.

I will get back to this 1 element allocation below.

 >> That part I still don't understand. The same block of e.g. 16 bytes
 >> still has room. Why not use that remaining portion?
 >
 > It does until it doesn't. Then it needs to reallocate.

I think my problem was caused by my test data being 16 bytes and (I 
think) the runtime picked memory from the 16-byte pool without any hope 
of future growth into adjacent block.

Using a 16-byte block sounds like a good strategy at first because 
nobody knows whether an array will get more than one element.

However, if my guess is correct (i.e. the first element of size of 
16-bytes is placed on a 16-byte block), then the next allocation will 
always allocate memory for the second element.

One might argue that dynamic arrays are likely to have more than a 
single element, so the initial block should at least be twice the 
element size. This would cut memory allocation by 1 count for all 
arrays. And in my case of 1-element arrays, allocation count would be 
halved. (Because I pay for every single append right now.)

Of course, I can understand that there can be applications where a large 
number of arrays (e.g. a 2D array) may have zero-elements or 
one-element, in which case my proposal of allocating the first element 
on a e.g. 32-byte block would be wasteful.

I think such cases are rare and we incur 1 extra allocation penalty for 
all arrays for that strategy.

 > Maybe some ASCII art? `A` is "used by the current slice", `x` is
 > "allocated, but not referenced by the array". `.` is "part of the block
 > but not used (i.e. can grow into this)". `M` is "metadata"
 >
 > ```d
 > auto arr = new ubyte[10];      // AAAA AAAA AA.. ...M
 > arr = arr[1 .. $];             // xAAA AAAA AA.. ...M
 > arr ~= 0;                      // xAAA AAAA AAA. ...M
 > arr ~= 0;                      // xAAA AAAA AAAA ...M
 > arr = arr[3 .. $];             // xxxx AAAA AAAA ...M
 > arr ~= cast(ubyte[])[1, 2, 3]; // xxxx AAAA AAAA AAAM // full!
 > arr ~= 1;                      // AAAA AAAA AAAA ...M // reallocated!
 > arr.ptr has changed
 > ```

That all makes sense. I didn't think the meta data would be at the end 
but I sense it's related to the "end slice", so it's a better place 
there. (?)

 > Metadata isn't always stored. Plus, it's optimized for the block size.
 > For example, any block that is 256 bytes or less only needs a single
 > byte to store the "used" space.

That's pretty interesting and smart.

 > What is your focus? Why do you really want this "optimization" of gluing
 > together items to happen?

This is about what you and I talked about in the past and something I 
mentioned in my DConf lightning talk this year. I am imagining a 
FIFO-like cache where elements of a source range are stored. There is a 
sliding window involved. I want to drop the unused front elements 
because they are expensive in 2 ways:

1) If the elements are not needed anymore, I have to move my slice 
forward so that the GC collects their pages.

2) If they are not needed anymore, I don't want to even copy them to a 
new block because this would be expensive, and in the case of an 
infinite source range, impossible.

The question is when to apply this dropping of old front elements. When 
I need to add one more element to the array, I can detect whether this 
*may* allocate by the expression 'arr.length == arr.capacity' but not 
really though, because the runtime may give me adjacent room without 
allocation. So I can delay the "drop the front elements" computation 
because there will be no actual copying at this time.

But the bigger issue is, because I drop elements my array never gets 
large enough to take advantage of this optimization and there is an 
allocation for every single append.

 > https://dlang.org/phobos/core_memory.html#.GC.extend

Ok, that sounds very useful. In addition to "I can detect when it *may* 
allocate", I can detect whether there is adjacent free room. (I can ask 
for just 1 element extension; I tested; and it works.) (I guess this 
GC.extend has to grab a GC lock.)

However, for that to work, I seem to need the initial block pointer that 
the GC knows about. (My sliding array's .ptr not work, so I have to save 
the initial arr.ptr).

Conclusion:

1) Although understanding the inner workings of the runtime is very 
useful and core.memory has interesting functions, it feels too much 
fragile work to get exactly what I want. I should manage my own memory 
(likely still backed by the GC).

2) I argue that the initial allocation should be more than 1 element so 
that we skip 1 allocation for most arrays and 50% for my window-of-1 
sliding window case.

Ali