Measuring the effect of heapSizeFactor on CPU and memory usage

Wed Nov 15 06:52:28 UTC 2023

Write-up over on 
https://gist.github.com/FeepingCreature/a47a3daed89d905668da08effaa4d6cd . I'll duplicate the content here as well, but I'm not sure if Github will be happy with hosting the images externally. If the graphs don't load, go to the Gist instead.

# The D GC

The D GC is tuned by default to trade memory for performance.
This can be clearly seen in the default heap size target of 2.0,
ie. the GC will prefer to just allocate more memory until less 
than half the heap memory is alive.
But with long-running user-triggered processes, memory can be 
more at a premium than CPU is, and larger heaps also mean slower 
collection runs.
Can we tweak GC parameters to make D programs use less memory? 
More importantly, what is the effect of doing so?

# Adjustable parameters

There are two important parameters: `heapSizeFactor` and 
`maxPoolSize`.

- `heapSizeFactor` defines the target "used heap to live memory" 
ratio. It defaults to `2.0`.
- `maxPoolSize` defines the maximum pool size, which is the unit 
by which D allocates (and releases) memory from the operating 
system.
   So residential memory usage will generally grow in units of 
`maxPoolSize`.

You can manually vary these parameters by passing 
`--DRT-gcopt="heapSizeFactor:1.1 maxPoolSize:8"` to any D program.

As a reference program, I'll use my heap fragmentation/GC leak 
testcase from
"[Why does this simple program leak 500MB of 
RAM?](https://forum.dlang.org/thread/llusnybxbhglcscixmbp@forum.dlang.org)".

# Observations

![heapsizefactor1.png](https://user-images.githubusercontent.com/540727/283025957-7f40a64c-fcfc-4b92-a5c1-9a0d8736e89d.png)

So here's a diagram of RSS memory usage and program runtime as I 
adjust `heapSizeFactor` (on the X axis).
We can clearly see two things:

- the D GC is extremely random in actual heap usage (as expected 
for a system without per-thread pools)
   but becomes less so as collections get more frequent
- you can get a significant improvement in memory usage for very 
little cost
- something wild happens between `heapSize=1.0` and 
`heapSizeFactor=1.1`.

Clearly, using a linear scale was a mistake. Let's try a 
different progression defined by `1 + 1 / (1.1 ^ x))`:

![heapsizefactor2.png](https://user-images.githubusercontent.com/540727/283026437-b633056f-7c51-49ad-9c8d-38a7ee35fdea.png)

I've added four runs with different `maxPoolSize` settings. 
Several additional things become clear:

- the exponential scale was the right way to go
- GC CPU usage goes up slower than memory goes down, indicating 
significant potential benefit.

Interestingly, adjustments between 2 and 1.1 seem to have very 
little effect.
Pretty much the only thing that matters is the number of zeroes 
and maybe the final digit.
For instance, if you're willing to accept a doubling of GC cost 
for a halving of RAM,
you should tune your `heapSizeFactor` to 1.002.

Annoyingly, there seems to be no benefit from `maxPoolSize`. The 
reduction in memory that you attain by smaller pools
is pretty much exactly made up by the increased CPU use, so that 
you could gain the same reduction by just running the GC
more often via `heapSizeFactor`. Still, good to know.

Note that this benchmark was performed with an extremely GC 
hungry program. Performance impact and benefit may vary with type
of process. Nonetheless, I'll be attempting and advocating to run 
all but the most CPU-hungry of our services with
`--DRT-gcopt=heapSizeFactor:1.002`.

# Speculation

Why do more aggressive GC runs reduce total memory used? I can't 
help but think it's down to heap fragmentation. D's GC is 
non-moving, meaning once a pointer is allocated, it has to stay 
there until it is freed. As a result, for programs that mix 
long-lived and short-lived allocations, such as "anything that 
parses with std.json" and "anything that uses threads at all", a 
pool that was only needed at peak memory usage may be kept alive 
by a small number of surviving allocations. In that case, more 
frequent GC runs will allow the program to pack more 
actually-alive content into the pools already allocated, reducing 
the peak use and thus fragmentation. In the long run it averages 
out, but in the long run I restart the service cause it uses too 
much memory anyways.

At any rate, without fundamental changes to the language, such as 
finding ways to make at least some allocations movable, there 
isn't anything to be done. For now, the default setting for 
`heapSizeFactor` of 2 may be good for benchmarks, but for 
long-running server processes, I suspect it makes the GC look 
worse than it is.