[dmd-concurrency] D's Memory Model
Robert Jacques
sandford at jhu.edu
Thu Feb 4 21:18:19 PST 2010
From a practical perspective, it is quite infuriating to correctly
implement a parallel algorithm, only to have it perform slower than its
single threaded counter part for no apparent reason. This commonly occurs
when some form of hidden serialization occurs and can be both
deterministic and non-deterministic at either the hardware or software
level. Although, fixing these issues is more of an optimization, than a
language design issue, some of the solutions require a slight change to
D's memory model and thus a change to the language.
False sharing
False sharing occurs when two separate memory locations on the same cache
line are modified by different threads. Automatic cache synchronization
then causes the entire line to be repeatedly flushed and re-loaded,
causing stalls and wasting bandwidth. This has lead to the practice of
padding certain heap allocated data structures (like message queues) by a
cache line before and by [cache line size - T.sizeof] after. However, this
is awkward to use, only fixes hard to find hot-spots and is not portable;
cache lines sizes differ from CPU to CPU. In D, this issue effects all
objects 32 bytes or smaller (on most CPUs). It occurs when local objects
from a thread are allocated next to shared, immutable or a different
thread's local objects. It also occurs when shared objects are located
next to each other or located next to immutable objects. (Immutable
objects can be safely allocated next to each other because they aren't
written to) Thread-local memory pools (see thread local allocation below)
should solve the first of these issues, by segregating local objects from
each other and shared state. The second issue can be addressed by having
the GC always allocate memory in cache line units for shared objects. This
would be portable (runtime cache-line determination) and more memory
efficient than manual object padding; only a single, instead of two cache
lines are used for small shared objects, like message nodes. The downside
of this approach is that it requires separate memory pools for local and
shared objects, which makes casting thread-local data to shared or
immutable a compiler error.
The Garbage Collection Lock
Today in D, the single, global GC lock is a major parallel performance
bottleneck. The solution is to use thread-local allocation, which has been
well proven in other languages (C/C++/C#/Java/etc); the basic concept is
that each thread maintains its own free-list (or bump-allocators/etc) from
which it allocates and deletes to. And as objects tend to live and die on
the thread that allocates them, even without separate local and shared
memory pools, false sharing is greatly reduced. In D however, thread-local
allocation is challenging as the collection algorithm has to determine
which threads to return garbage to. Getting this wrong leads to both slow
performance and memory leaks. The Boehm GC, for example, hands out a page
of objects to the local allocators when they run low. However, if they are
'clean', memory is wasted by excessive heap fragmentation, while if they
are 'dirty', the amortized cost of the GC lock is greater and false
sharing occurs. To prevent false pointers, the garbage from 'dirty' (aka
fragmented) pages should be returned to their allocating thread, at least
for pages under the cache-line size. In general, there is a trade off
between giving garbage back preemptively and holding it globally: the
former tends to lock infrequently, but always stops at least 1 other
thread when it does (to steal clean pages), while the later locks
frequency but doesn't necessarily stop any other threads.
One additional advantage to the segregating the local and shared/immutable
memory pools is that it naturally lends itself to thread-local garbage
collection; a modern GC style which is pseudo-concurrent, parallel,
high-throughput, low-overhead, scales linearly and has reduced false
pointers and cache misses.
More information about the dmd-concurrency
mailing list