[dmd-concurrency] D's Memory Model

Mon Feb 8 10:15:42 PST 2010

On Mon, 08 Feb 2010 06:07:36 -0500, Fawzi Mohamed <fawzi at gmx.ch> wrote:
> I was away a couple of days, and coming back I found your interesting  
> post Robert Jacques.
>
> I had not thought much about false sharing, thanks for having raised the  
> issue.
>
> In general both for false sharing and for the global GC lock (that is  
> *really* a bottleneck for current heavily multithreaded code if you  
> allocate in the threads) I fully agree that a having thread local (or  
> cpu local) allocators is the correct way out.
>
> Without shared you can have two approaches.

I'm assuming you are talking the collection, not allocation phase. The  
empirical tests I've seen indicate that currently the allocation  
bottleneck dwarfs the collection bottleneck.

> One is  the simplistic "stop the world" (but making the mark & co  
> concurrent). The problem is the latency that is introduced by this  
> approach, but if done correctly (well parallelized) the ratio GC time/ 
> computation time should not be worse than other approaches.

Your statement is a bit confusing since concurrent collectors are  
generally not considered "stop the world". Concurrent collectors trade  
throughput (GC time / program time) for program responsiveness. They also  
often trade program correctness for latency, as they are notoriously hard  
to get correct for languages with mutation. While, there are some very  
elegance concurrent GCs for functional languages, the high level  
descriptions of the Boehm concurrent GC seems to be inherently racy, while  
Java concurrent GCs tend to be highly instrumented and lock-tastic  
(mark-bit locking on every ref assignment). As D is a multi-paradigm  
language which allows C function calls, most of the existing concurrent  
collector algorithms don't seem to be applicable to D. (Please feel free  
to correct me if you know of a paper on a concurrent GC that would be good  
for D)

Parallel collectors, on the other hand, address throughput and come with  
no theoretical issues. As parallel GCs can be "stop the world", this is  
what I assumed you meant.

> The other approach is similar to the generational GC: you need to  
> establish checkpoints, and know which allocations are performed before  
> or after the checkpoint.
> at that point, after the mark phase, you can free the non marked  
> allocations that were done before the checkpoint.
> With multithreading one local pool on the other hand needs to collect  
> all the "retains" coming from other threads done before the checkpoint,  
> and then free all the non marked allocation.

This sounds like "very concurrent garbage collection" which is one of the  
elegant but immutable-objects-only collectors. Generational collectors are  
a form of escape analysis: when a new/young object reference is assigned  
to an old object's member variable it escapes to the older generation. To  
be efficient, generational collectors need bump-allocators (and a  
moving/coping GC) or an "objects only" language, so one can efficiency  
edit a class's meta-data. D's use of interior pointers, structs and array  
slices, not to mention C function calls, make generational collectors  
difficult to implement correctly, let alone efficiently.

> This introduces a delay in the deallocation. Generational GC then make  
> another GC just for the newer generation, without caring to mark things  
> in the "old" generation, but that is another story.
>
> The communication of the retains of the other threads, and the  
> establishing of checkpoints can be done either preemptively or upon  
> request (one can simply have the information stored in some way in the  
> pools, for example keeping an order in the allocations, and storing a  
> checkpoint on the stack, that suspends the thread if it wants to reduce  
> the stack before the mark phase on it is performed).
> The thread doing the GC can be one of the running threads (or even  
> several), or a thread dedicated to it.
>
> In a 64 bit system one can really do a memory partitioning without fear  
> of exhausting the memory space, thus making the question "which  
> pool/thread owns this pointer" very fast to answer, with 32 bit is more  
> tricky.

I'd disagree in that memory partitioning should be equally difficult on  
either 32-bit or 64-bit platforms. Consider memory pages A B C. If thread  
1 allocates some memory and gets page A, then thread 2 allocates some  
memory and gets page B, when thread 1 next allocates memory it'll get page  
C and have a discontinuous memory set (A C).

> In general one would not expect heavy sharing between threads so for  
> example an explicit list of external pointers could be kept if smaller  
> than X.

Do you know an example of this? Given the Java thread-local collection  
papers, this doesn't seem to be done in practice. Also, it seems very  
overhead intensive and prone to races.

> Virtual memory is another issue, but one could assume that the program  
> is all in real memory (tackling that needs OS collaboration).
>
> This is a discussion *without* shared,... using it should allow one to  
> reduce the communication of pointers retained by other threads to shared  
> objects. and make a fully local gc for the non shared objects, without  
> needing to synchronize with other threads.

Yes, this is called thread-local garbage collection (which I mentioned).

> This is an advantage, but using stack variables, scope variables, good  
> escape analysis and automatic collection for what does not escape, a  
> good concurrent gc,... the gain offered by shared is not so clear cut to  
> me, seeing its complexity.

First, full escape analysis is both difficult and slow in interpreted  
languages and impossible in compiled languages such as D. So automatic  
collection is out too.
Second, scope/stack variables are limited in use and size; for example,  
you can't define trees or dynamic arrays on the stack nor exceed the stack  
size (typically 1MB)
Third, thread-local mark-sweep GCs actually equal or out perform modern  
concurrent-parallel-generational GCs (according to Apple), and that's in a  
language which supports them. Further, concurrent collectors need to  
access the mark bits of an object. In D this means taking the global GC  
lock on _every_ reference assignment and traversing a log N memory tree.  
Even in Java, where the lack of interior pointers and an object-only  
paradigm allows the use of two-three atomic instructions on every  
assignment, eliminating this unneeded and excess overhead would be a huge  
win.
Fourth, Bearophile brought this up on the newsgroup; one of the biggest  
mistakes of Java is considered to be that every object is also a monitor.  
Without shared, D would have to make the same mistake.

> Page reuse between different local pools is indeed an issue, I would say  
> that with 64 bits one can simply give empty pages back to the system if  
> they are over a given threshold, whereas with 32 bit indeed some more  
> tricks (stealing dirty pages) might be in order. Please note that  
> stealing pages from another numa node is not something you want to do  
> unless it is unavoidable

I don't see what 64-bit vs 32-bit has to do with this; the former simply  
lets you access more memory, it doesn't magically prevent page  
fragmentation, false sharing, etc.