What's the go with the GC these days?

Sun Jan 6 20:28:47 UTC 2019

On Saturday, 5 January 2019 at 22:05:19 UTC, Manu wrote:
> How much truth is in here?
> What is this business about write barriers making the GC fast? 
> Is that
> actually fact, or are there other hurdles?

First of all, stop-the-world GC gets an unnecessarily bad rap 
these
days. A high throughput stop-the-world GC can actually be pretty
competitive in terms of throughput, and latency may be lower than 
what
you'd expect. You probably aren't going to run AAA video games 
with one,
but there are many application domains where latency just doesn't 
matter
or doesn't matter as much; even normal interactive use (such as 
your
typical GUI application) often works plenty fine. Stop-the-world
collectors are more of a problem for a functional programming 
language
(or languages that exercise allocation-heavy functional idioms a 
lot),
but less so for imperative languages that have more control over
allocation frequency.

In general, a modern precise stop-the-world collector should be 
able to
handle 1GB worth of pointers in something like .2 seconds on 
modern
hardware, give or take (especially if you can leverage 
parallelism).
Given that much of the heap will generally be occupied by 
non-pointer
data in an imperative language, such as strings, bitmaps, or 
numerical
data, this is plenty for typical interactive applications (you 
will
often have interaction hiccups of comparable size from other 
system
sources) and reasonably written server-side software can use a
multi-process approach to limit the per-process heap size. For 
batch
programs and most command line tools, latency matters little, 
anyway.
Throughput-wise, you should be able to perform on par with 
something
like jemalloc, possibly faster, especially if you have compiler 
support
(to e.g. inline pool allocations and avoid unnecessary zeroing of 
newly
allocated memory if you statically know the object's size and 
layout).
You generally won't be able to beat specialized allocators for
throughput, though, just general purpose allocators.

Note that manual memory allocation is not a free lunch, either. 
As an
extreme case, normal reference counting is like having a very 
expensive
write barrier that even applies to stack frames. Manual 
allocation only
really wins out for throughput if you can use optimized allocation
techniques (such as a pool/arena allocator where you throw the
pool/arena away at the end).

The problem with the D garbage collector is that it doesn't do 
well in
terms of either throughput or latency. Even so, GC work is 
proportional
to allocation work, so the impact on throughput in an imperative
language like D is often less than you think, even if the GC 
doesn't
have good performance. For example, I've used Tiny GC [1] in real 
world
C/C++ applications where I needed to avoid external dependencies 
and the
code still had good performance. That despite the fact that due 
its
design (a conservative GC based on system malloc()), its 
throughput
isn't particularly great.

Fun fact: Erlang uses a stop-the-world collector (or used to, I 
haven't
checked in a while), but as it is a distributed system with tons 
of
processes with small heaps, it can also keep latency low, as each
individual collection only has to traverse a limited amount of 
memory.

In order to get incremental GC with low pause times, you will 
need one
of several approaches:

-   Write barriers.
-   Read barriers.
-   Snapshot techniques.
-   Optimized reference counting.

In practice, read barriers and snapshots without hardware support 
are
often too inefficient for typical programming languages. Read 
barriers
are still used (with heavy compiler support) in Azul's GC and
Shenandoah, because (inter alia) they allow you to get around the 
last
problem for extremely low latency GC when it comes to scanning 
stacks on
a multi-threaded system.

Snapshot-based systems are rare, but there was at one point a
fork()-based snapshot collector for D1. The dual problem with 
that is
that fork() is POSIX-specific and fork() does not get along well 
with
threads. (It's possible, but a pain in the neck.)

Write barriers are usually the technique of choice, because they 
only
incur overhead when pointers are written to the heap, so they have
fairly low and predictable overhead.

Write barriers can also improve throughput. If you use 
generational
collection *and* the generational hypothesis holds (i.e. you have 
a fair
amount of short-lived objects), then the cost of the write 
barrier is
usually more than offset by the greatly reduced tracing work 
necessary.
You will need a write barrier for generational collection, 
anyway, so
you get the incremental part for (mostly) free. Write barriers 
are also
particularly attractive in functional (including mostly 
functional)
languages, where pointer writes to the heap are uncommon to begin 
with.
That said, in an imperative language you can often avoid having
short-lived heap-allocated objects, so the generational 
hypothesis may
not always apply. The most common use cases where generational GC 
might
benefit D are probably if you are doing lots of work with strings 
or are
incrementally growing a dynamic array starting from a small size.

The overhead of a write barrier is context-dependent. Probably 
the worst
case is something like sorting an array in place, where each 
pointer
swap will require two write barriers. That said, these situations 
can be
worked around, especially in an optimizing compiler that can elide
multiple write barriers for the same object, and more typical 
code will
not write pointers to the heap that much to begin with.

Basic reference counting is normally dog slow due to memory 
locality
issues. However, you can optimize the most performance degrading 
case
(assignments to local variables) in a couple of ways. You can 
optimize
away some, which is what Swift does [2]. Or you can use deferred
reference counting, which counts only references from global 
variables
and the heap, but requires occasional stack scanning. Reference 
counting
is less attractive for the multi-threaded case with shared heaps,
because atomic reference counting is even slower; competitive
implementations of deferred reference counting for Java therefore 
buffer
increments and decrements in thread-local storage and process 
them in
batches [3]. Also, additional work has to be done to deal with 
cycles.

And, of course, there are various hybrid approaches, such as
generational collection for the minor heap and reference counting 
for
the major heap.

Going back to D, the obvious low-hanging fruit is to bring its
throughput up to par with modern stop-the-world collectors. If 
your
latency requirements are much higher than what that can give you,
chances are that you are already operating in a highly specialized
domain and have very specific requirements where you'd also have 
to
invest non-trivial effort into tuning a general purpose GC.

That said, having (optional) support for write barriers in the 
compiler
would be pretty nice, though I don't know how much effort that 
would be.

[1] https://github.com/ivmai/tinygc

[2] 
https://github.com/apple/swift/blob/master/docs/ARCOptimization.rst

[3] See David Bacon's seminal paper at
https://researcher.watson.ibm.com/researcher/files/us-bacon/Bacon01Concurrent.pdf