What's the go with the GC these days?
Reimer Behrends
behrends at gmail.com
Tue Jan 8 11:22:59 UTC 2019
On Tuesday, 8 January 2019 at 09:31:09 UTC, Márcio Martins wrote:
> Is the current implementation of simple MS not improvable in
> any way?
I'd say it is. I've written up a small benchmark here:
https://gist.github.com/rbehrends/3d4ab87d34e79722272189abc68eee00
Options:
* D with its native GC, compiled with ldc 1.13.0.
* jemalloc, version 5.1.0.
* The Boehm GC, version 8.0.2 with sequential marking.
* The Boehm GC with parallel marking.
* The Boehm GC using incremental GC.
* The Boehm GC with explicit deallocation instead of GC.
* System malloc on macOS 10.12.6.
This is a version of the "binary trees" benchmark [1], as it was
the one allocation-oriented benchmark readily available and
easily portable between languages. Because it is a very
specialized benchmark (it allocates a *lot* of small fixed-size
objects), I'll caution against generalizing its results, but it
can give us some insights all the same.
Here's a sample run with several GC options enabled, run on a
2.6GHz Core i7 with four cores; code that doesn't do
allocation/collection (i.e. checksum()) accounts for less than
two seconds of runtime. Most actual tuned non-functional programs
will spend only 10%-20% doing allocation and collection, so this
is not terribly representative of real applications and
exaggerates differences in performance.
$ make benchmark DEPTH=21
# D garbage collector
/usr/bin/time ./btree-d 21 >/dev/null
34.24 real 33.96 user 0.21 sys
# jemalloc explicit malloc()/free()
/usr/bin/time ./btree-jemalloc 21 >/dev/null
21.62 real 21.35 user 0.22 sys
# Boehm GC with four parallel marker threads
GC_MARKERS=4 /usr/bin/time ./btree-gc 21 >/dev/null
9.39 real 12.02 user 0.18 sys
# Boehm GC with single-threaded marking
GC_MARKERS=1 /usr/bin/time ./btree-gc 21 >/dev/null
11.42 real 11.34 user 0.06 sys
# Boehm GC with explicit deallocation
GC_MARKERS=1 /usr/bin/time ./btree-gc-free 21 >/dev/null
13.12 real 13.05 user 0.05 sys
# Boehm GC with incremental collection (single-threaded)
/usr/bin/time ./btree-gc-inc 21 >/dev/null
20.74 real 17.21 user 6.29 sys
# System malloc()/free()
/usr/bin/time ./btree-sysmalloc 21 >/dev/null
67.73 real 66.97 user 0.74 sys
The fastest version is the parallel-mark version of the
Boehm-Demers-Weiser GC, which beats out all the competitors. The
rear is brought up by system malloc() (to nobody's surprise,
standard malloc()/free() on a Mac is dog slow).
The Boehm GC benchmarks are run with interior pointers disabled
in order to prevent extra padding for the small objects that
would blow up the heap size, but even with interior pointers
enabled, the relative performance doesn't change all that much;
the single-threaded run clocks in at around 16 seconds instead,
still twice as fast as the D GC, and the version with parallel
marking finishes in around 13 seconds. It is worth noting that
the Boehm GC still does go through all the conservative marking
machinery and does check interior pointers from the stack
regardless, so a precise or semi-precise collector may still be
able to gain additional performance.
Parallel marking gives us a noticeable speedup in wall clock time
and a really big one in GC pauses (not displayed here) at the
expense of using more CPU time.
Incremental collection is super expensive in comparison, due to
the cost of mprotect(), for code that technically would not
require write barriers. It still is on par with jemalloc() for
throughput.
In terms of GC pause times, we get the following for the Boehm GC:
$ time GC_MARKERS=4 ./btree-gc 23 | tail -5
88 collections
0.056 seconds/collection
0.080 max seconds/collection
818 MB heap size
306 MB free bytes
This is again with parallel marking (four marker threads) and
using a larger heap. The collection times are wall clock times,
based on gettimeofday(). Note that most heaps will contain at
least some non-pointer data, ours is almost 100% pointers. With
single-threaded marking, the maximum collection time more than
triples to nearly .3 seconds.
As an interesting aside, the maximum pause time for explicitly
deleting trees in the benchmark is much larger than the GC pause
time (nearly a second for jemalloc()) and would remain larger
even if the delete tree operation were parallelized. This is a
good reminder that cascading deletes in RAII can also be
problematic for pause times and generally require tradeoffs
(deferred/concurrent deletion, sacrificing timely destructor
calls) in order to work in real time systems.
[1]
https://benchmarksgame-team.pages.debian.net/benchmarksgame/performance/binarytrees.html
More information about the Digitalmars-d
mailing list