What's the go with the GC these days?

Tue Jan 8 11:22:59 UTC 2019

On Tuesday, 8 January 2019 at 09:31:09 UTC, Márcio Martins wrote:
> Is the current implementation of simple MS not improvable in 
> any way?

I'd say it is. I've written up a small benchmark here:

https://gist.github.com/rbehrends/3d4ab87d34e79722272189abc68eee00

Options:

* D with its native GC, compiled with ldc 1.13.0.
* jemalloc, version 5.1.0.
* The Boehm GC, version 8.0.2 with sequential marking.
* The Boehm GC with parallel marking.
* The Boehm GC using incremental GC.
* The Boehm GC with explicit deallocation instead of GC.
* System malloc on macOS 10.12.6.

This is a version of the "binary trees" benchmark [1], as it was 
the one allocation-oriented benchmark readily available and 
easily portable between languages. Because it is a very 
specialized benchmark (it allocates a *lot* of small fixed-size 
objects), I'll caution against generalizing its results, but it 
can give us some insights all the same.

Here's a sample run with several GC options enabled, run on a 
2.6GHz Core i7 with four cores; code that doesn't do 
allocation/collection (i.e. checksum()) accounts for less than 
two seconds of runtime. Most actual tuned non-functional programs 
will spend only 10%-20% doing allocation and collection, so this 
is not terribly representative of real applications and 
exaggerates differences in performance.

$ make benchmark DEPTH=21
# D garbage collector
/usr/bin/time ./btree-d 21 >/dev/null
        34.24 real        33.96 user         0.21 sys
# jemalloc explicit malloc()/free()
/usr/bin/time ./btree-jemalloc 21 >/dev/null
        21.62 real        21.35 user         0.22 sys
# Boehm GC with four parallel marker threads
GC_MARKERS=4 /usr/bin/time ./btree-gc 21 >/dev/null
         9.39 real        12.02 user         0.18 sys
# Boehm GC with single-threaded marking
GC_MARKERS=1 /usr/bin/time ./btree-gc 21 >/dev/null
        11.42 real        11.34 user         0.06 sys
# Boehm GC with explicit deallocation
GC_MARKERS=1 /usr/bin/time ./btree-gc-free 21 >/dev/null
        13.12 real        13.05 user         0.05 sys
# Boehm GC with incremental collection (single-threaded)
/usr/bin/time ./btree-gc-inc 21 >/dev/null
        20.74 real        17.21 user         6.29 sys
# System malloc()/free()
/usr/bin/time ./btree-sysmalloc 21 >/dev/null
        67.73 real        66.97 user         0.74 sys

The fastest version is the parallel-mark version of the 
Boehm-Demers-Weiser GC, which beats out all the competitors. The 
rear is brought up by system malloc() (to nobody's surprise, 
standard malloc()/free() on a Mac is dog slow).

The Boehm GC benchmarks are run with interior pointers disabled 
in order to prevent extra padding for the small objects that 
would blow up the heap size, but even with interior pointers 
enabled, the relative performance doesn't change all that much; 
the single-threaded run clocks in at around 16 seconds instead, 
still twice as fast as the D GC, and the version with parallel 
marking finishes in around 13 seconds. It is worth noting that 
the Boehm GC still does go through all the conservative marking 
machinery and does check interior pointers from the stack 
regardless, so a precise or semi-precise collector may still be 
able to gain additional performance.

Parallel marking gives us a noticeable speedup in wall clock time 
and a really big one in GC pauses (not displayed here) at the 
expense of using more CPU time.

Incremental collection is super expensive in comparison, due to 
the cost of mprotect(), for code that technically would not 
require write barriers. It still is on par with jemalloc() for 
throughput.

In terms of GC pause times, we get the following for the Boehm GC:

$ time GC_MARKERS=4 ./btree-gc 23 | tail -5
       88 collections
    0.056 seconds/collection
    0.080 max seconds/collection
      818 MB heap size
      306 MB free bytes

This is again with parallel marking (four marker threads) and 
using a larger heap. The collection times are wall clock times, 
based on gettimeofday(). Note that most heaps will contain at 
least some non-pointer data, ours is almost 100% pointers. With 
single-threaded marking, the maximum collection time more than 
triples to nearly .3 seconds.

As an interesting aside, the maximum pause time for explicitly 
deleting trees in the benchmark is much larger than the GC pause 
time (nearly a second for jemalloc()) and would remain larger 
even if the delete tree operation were parallelized. This is a 
good reminder that cascading deletes in RAII can also be 
problematic for pause times and generally require tradeoffs 
(deferred/concurrent deletion, sacrificing timely destructor 
calls) in order to work in real time systems.

[1] 
https://benchmarksgame-team.pages.debian.net/benchmarksgame/performance/binarytrees.html