[dmd-concurrency] D's Memory Model

Robert Jacques sandford at jhu.edu
Thu Feb 4 21:18:19 PST 2010


 From a practical perspective, it is quite infuriating to correctly  
implement a parallel algorithm, only to have it perform slower than its  
single threaded counter part for no apparent reason. This commonly occurs  
when some form of hidden serialization occurs and can be both  
deterministic and non-deterministic at either the hardware or software  
level. Although, fixing these issues is more of an optimization, than a  
language design issue, some of the solutions require a slight change to  
D's memory model and thus a change to the language.

False sharing
False sharing occurs when two separate memory locations on the same cache  
line are modified by different threads. Automatic cache synchronization  
then causes the entire line to be repeatedly flushed and re-loaded,  
causing stalls and wasting bandwidth. This has lead to the practice of  
padding certain heap allocated data structures (like message queues) by a  
cache line before and by [cache line size - T.sizeof] after. However, this  
is awkward to use, only fixes hard to find hot-spots and is not portable;  
cache lines sizes differ from CPU to CPU. In D, this issue effects all  
objects 32 bytes or smaller (on most CPUs). It occurs when local objects  
 from a thread are allocated next to shared, immutable or a different  
thread's local objects. It also occurs when shared objects are located  
next to each other or located next to immutable objects. (Immutable  
objects can be safely allocated next to each other because they aren't  
written to) Thread-local memory pools (see thread local allocation below)  
should solve the first of these issues, by segregating local objects from  
each other and shared state. The second issue can be addressed by having  
the GC always allocate memory in cache line units for shared objects. This  
would be portable (runtime cache-line determination) and more memory  
efficient than manual object padding; only a single, instead of two cache  
lines are used for small shared objects, like message nodes. The downside  
of this approach is that it requires separate memory pools for local and  
shared objects, which makes casting thread-local data to shared or  
immutable a compiler error.

The Garbage Collection Lock
Today in D, the single, global GC lock is a major parallel performance  
bottleneck. The solution is to use thread-local allocation, which has been  
well proven in other languages (C/C++/C#/Java/etc); the basic concept is  
that each thread maintains its own free-list (or bump-allocators/etc) from  
which it allocates and deletes to. And as objects tend to live and die on  
the thread that allocates them, even without separate local and shared  
memory pools, false sharing is greatly reduced. In D however, thread-local  
allocation is challenging as the collection algorithm has to determine  
which threads to return garbage to. Getting this wrong leads to both slow  
performance and memory leaks. The Boehm GC, for example, hands out a page  
of objects to the local allocators when they run low. However, if they are  
'clean', memory is wasted by excessive heap fragmentation, while if they  
are 'dirty', the amortized cost of the GC lock is greater and false  
sharing occurs. To prevent false pointers, the garbage from 'dirty' (aka  
fragmented) pages should be returned to their allocating thread, at least  
for pages under the cache-line size. In general, there is a trade off  
between giving garbage back preemptively and holding it globally: the  
former tends to lock infrequently, but always stops at least 1 other  
thread when it does (to steal clean pages), while the later locks  
frequency but doesn't necessarily stop any other threads.

One additional advantage to the segregating the local and shared/immutable  
memory pools is that it naturally lends itself to thread-local garbage  
collection; a modern GC style which is pseudo-concurrent, parallel,  
high-throughput, low-overhead, scales linearly and has reduced false  
pointers and cache misses.


More information about the dmd-concurrency mailing list