What's the go with the GC these days?
Reimer Behrends
behrends at gmail.com
Sun Jan 6 20:28:47 UTC 2019
On Saturday, 5 January 2019 at 22:05:19 UTC, Manu wrote:
> How much truth is in here?
> What is this business about write barriers making the GC fast?
> Is that
> actually fact, or are there other hurdles?
First of all, stop-the-world GC gets an unnecessarily bad rap
these
days. A high throughput stop-the-world GC can actually be pretty
competitive in terms of throughput, and latency may be lower than
what
you'd expect. You probably aren't going to run AAA video games
with one,
but there are many application domains where latency just doesn't
matter
or doesn't matter as much; even normal interactive use (such as
your
typical GUI application) often works plenty fine. Stop-the-world
collectors are more of a problem for a functional programming
language
(or languages that exercise allocation-heavy functional idioms a
lot),
but less so for imperative languages that have more control over
allocation frequency.
In general, a modern precise stop-the-world collector should be
able to
handle 1GB worth of pointers in something like .2 seconds on
modern
hardware, give or take (especially if you can leverage
parallelism).
Given that much of the heap will generally be occupied by
non-pointer
data in an imperative language, such as strings, bitmaps, or
numerical
data, this is plenty for typical interactive applications (you
will
often have interaction hiccups of comparable size from other
system
sources) and reasonably written server-side software can use a
multi-process approach to limit the per-process heap size. For
batch
programs and most command line tools, latency matters little,
anyway.
Throughput-wise, you should be able to perform on par with
something
like jemalloc, possibly faster, especially if you have compiler
support
(to e.g. inline pool allocations and avoid unnecessary zeroing of
newly
allocated memory if you statically know the object's size and
layout).
You generally won't be able to beat specialized allocators for
throughput, though, just general purpose allocators.
Note that manual memory allocation is not a free lunch, either.
As an
extreme case, normal reference counting is like having a very
expensive
write barrier that even applies to stack frames. Manual
allocation only
really wins out for throughput if you can use optimized allocation
techniques (such as a pool/arena allocator where you throw the
pool/arena away at the end).
The problem with the D garbage collector is that it doesn't do
well in
terms of either throughput or latency. Even so, GC work is
proportional
to allocation work, so the impact on throughput in an imperative
language like D is often less than you think, even if the GC
doesn't
have good performance. For example, I've used Tiny GC [1] in real
world
C/C++ applications where I needed to avoid external dependencies
and the
code still had good performance. That despite the fact that due
its
design (a conservative GC based on system malloc()), its
throughput
isn't particularly great.
Fun fact: Erlang uses a stop-the-world collector (or used to, I
haven't
checked in a while), but as it is a distributed system with tons
of
processes with small heaps, it can also keep latency low, as each
individual collection only has to traverse a limited amount of
memory.
In order to get incremental GC with low pause times, you will
need one
of several approaches:
- Write barriers.
- Read barriers.
- Snapshot techniques.
- Optimized reference counting.
In practice, read barriers and snapshots without hardware support
are
often too inefficient for typical programming languages. Read
barriers
are still used (with heavy compiler support) in Azul's GC and
Shenandoah, because (inter alia) they allow you to get around the
last
problem for extremely low latency GC when it comes to scanning
stacks on
a multi-threaded system.
Snapshot-based systems are rare, but there was at one point a
fork()-based snapshot collector for D1. The dual problem with
that is
that fork() is POSIX-specific and fork() does not get along well
with
threads. (It's possible, but a pain in the neck.)
Write barriers are usually the technique of choice, because they
only
incur overhead when pointers are written to the heap, so they have
fairly low and predictable overhead.
Write barriers can also improve throughput. If you use
generational
collection *and* the generational hypothesis holds (i.e. you have
a fair
amount of short-lived objects), then the cost of the write
barrier is
usually more than offset by the greatly reduced tracing work
necessary.
You will need a write barrier for generational collection,
anyway, so
you get the incremental part for (mostly) free. Write barriers
are also
particularly attractive in functional (including mostly
functional)
languages, where pointer writes to the heap are uncommon to begin
with.
That said, in an imperative language you can often avoid having
short-lived heap-allocated objects, so the generational
hypothesis may
not always apply. The most common use cases where generational GC
might
benefit D are probably if you are doing lots of work with strings
or are
incrementally growing a dynamic array starting from a small size.
The overhead of a write barrier is context-dependent. Probably
the worst
case is something like sorting an array in place, where each
pointer
swap will require two write barriers. That said, these situations
can be
worked around, especially in an optimizing compiler that can elide
multiple write barriers for the same object, and more typical
code will
not write pointers to the heap that much to begin with.
Basic reference counting is normally dog slow due to memory
locality
issues. However, you can optimize the most performance degrading
case
(assignments to local variables) in a couple of ways. You can
optimize
away some, which is what Swift does [2]. Or you can use deferred
reference counting, which counts only references from global
variables
and the heap, but requires occasional stack scanning. Reference
counting
is less attractive for the multi-threaded case with shared heaps,
because atomic reference counting is even slower; competitive
implementations of deferred reference counting for Java therefore
buffer
increments and decrements in thread-local storage and process
them in
batches [3]. Also, additional work has to be done to deal with
cycles.
And, of course, there are various hybrid approaches, such as
generational collection for the minor heap and reference counting
for
the major heap.
Going back to D, the obvious low-hanging fruit is to bring its
throughput up to par with modern stop-the-world collectors. If
your
latency requirements are much higher than what that can give you,
chances are that you are already operating in a highly specialized
domain and have very specific requirements where you'd also have
to
invest non-trivial effort into tuning a general purpose GC.
That said, having (optional) support for write barriers in the
compiler
would be pretty nice, though I don't know how much effort that
would be.
[1] https://github.com/ivmai/tinygc
[2]
https://github.com/apple/swift/blob/master/docs/ARCOptimization.rst
[3] See David Bacon's seminal paper at
https://researcher.watson.ibm.com/researcher/files/us-bacon/Bacon01Concurrent.pdf
More information about the Digitalmars-d
mailing list