Measuring the effect of heapSizeFactor on CPU and memory usage
FeepingCreature
feepingcreature at gmail.com
Wed Nov 15 06:52:28 UTC 2023
Write-up over on
https://gist.github.com/FeepingCreature/a47a3daed89d905668da08effaa4d6cd . I'll duplicate the content here as well, but I'm not sure if Github will be happy with hosting the images externally. If the graphs don't load, go to the Gist instead.
# The D GC
The D GC is tuned by default to trade memory for performance.
This can be clearly seen in the default heap size target of 2.0,
ie. the GC will prefer to just allocate more memory until less
than half the heap memory is alive.
But with long-running user-triggered processes, memory can be
more at a premium than CPU is, and larger heaps also mean slower
collection runs.
Can we tweak GC parameters to make D programs use less memory?
More importantly, what is the effect of doing so?
# Adjustable parameters
There are two important parameters: `heapSizeFactor` and
`maxPoolSize`.
- `heapSizeFactor` defines the target "used heap to live memory"
ratio. It defaults to `2.0`.
- `maxPoolSize` defines the maximum pool size, which is the unit
by which D allocates (and releases) memory from the operating
system.
So residential memory usage will generally grow in units of
`maxPoolSize`.
You can manually vary these parameters by passing
`--DRT-gcopt="heapSizeFactor:1.1 maxPoolSize:8"` to any D program.
As a reference program, I'll use my heap fragmentation/GC leak
testcase from
"[Why does this simple program leak 500MB of
RAM?](https://forum.dlang.org/thread/llusnybxbhglcscixmbp@forum.dlang.org)".
# Observations

So here's a diagram of RSS memory usage and program runtime as I
adjust `heapSizeFactor` (on the X axis).
We can clearly see two things:
- the D GC is extremely random in actual heap usage (as expected
for a system without per-thread pools)
but becomes less so as collections get more frequent
- you can get a significant improvement in memory usage for very
little cost
- something wild happens between `heapSize=1.0` and
`heapSizeFactor=1.1`.
Clearly, using a linear scale was a mistake. Let's try a
different progression defined by `1 + 1 / (1.1 ^ x))`:

I've added four runs with different `maxPoolSize` settings.
Several additional things become clear:
- the exponential scale was the right way to go
- GC CPU usage goes up slower than memory goes down, indicating
significant potential benefit.
Interestingly, adjustments between 2 and 1.1 seem to have very
little effect.
Pretty much the only thing that matters is the number of zeroes
and maybe the final digit.
For instance, if you're willing to accept a doubling of GC cost
for a halving of RAM,
you should tune your `heapSizeFactor` to 1.002.
Annoyingly, there seems to be no benefit from `maxPoolSize`. The
reduction in memory that you attain by smaller pools
is pretty much exactly made up by the increased CPU use, so that
you could gain the same reduction by just running the GC
more often via `heapSizeFactor`. Still, good to know.
Note that this benchmark was performed with an extremely GC
hungry program. Performance impact and benefit may vary with type
of process. Nonetheless, I'll be attempting and advocating to run
all but the most CPU-hungry of our services with
`--DRT-gcopt=heapSizeFactor:1.002`.
# Speculation
Why do more aggressive GC runs reduce total memory used? I can't
help but think it's down to heap fragmentation. D's GC is
non-moving, meaning once a pointer is allocated, it has to stay
there until it is freed. As a result, for programs that mix
long-lived and short-lived allocations, such as "anything that
parses with std.json" and "anything that uses threads at all", a
pool that was only needed at peak memory usage may be kept alive
by a small number of surviving allocations. In that case, more
frequent GC runs will allow the program to pack more
actually-alive content into the pools already allocated, reducing
the peak use and thus fragmentation. In the long run it averages
out, but in the long run I restart the service cause it uses too
much memory anyways.
At any rate, without fundamental changes to the language, such as
finding ways to make at least some allocations movable, there
isn't anything to be done. For now, the default setting for
`heapSizeFactor` of 2 may be good for benchmarks, but for
long-running server processes, I suspect it makes the GC look
worse than it is.
More information about the Digitalmars-d
mailing list