[dmd-concurrency] D's Memory Model

Mon Feb 8 03:07:36 PST 2010

I was away a couple of days, and coming back I found your interesting  
post Robert Jacques.

I had not thought much about false sharing, thanks for having raised  
the issue.

In general both for false sharing and for the global GC lock (that is  
*really* a bottleneck for current heavily multithreaded code if you  
allocate in the threads) I fully agree that a having thread local (or  
cpu local) allocators is the correct way out.

Without shared you can have two approaches.

One is  the simplistic "stop the world" (but making the mark & co  
concurrent). The problem is the latency that is introduced by this  
approach, but if done correctly (well parallelized) the ratio GC time/ 
computation time should not be worse than other approaches.

The other approach is similar to the generational GC: you need to  
establish checkpoints, and know which allocations are performed before  
or after the checkpoint.
at that point, after the mark phase, you can free the non marked  
allocations that were done before the checkpoint.
With multithreading one local pool on the other hand needs to collect  
all the "retains" coming from other threads done before the  
checkpoint, and then free all the non marked allocation.

This introduces a delay in the deallocation. Generational GC then make  
another GC just for the newer generation, without caring to mark  
things in the "old" generation, but that is another story.

The communication of the retains of the other threads, and the  
establishing of checkpoints can be done either preemptively or upon  
request (one can simply have the information stored in some way in the  
pools, for example keeping an order in the allocations, and storing a  
checkpoint on the stack, that suspends the thread if it wants to  
reduce the stack before the mark phase on it is performed).
The thread doing the GC can be one of the running threads (or even  
several), or a thread dedicated to it.

In a 64 bit system one can really do a memory partitioning without  
fear of exhausting the memory space, thus making the question "which  
pool/thread owns this pointer" very fast to answer, with 32 bit is  
more tricky.
In general one would not expect heavy sharing between threads so for  
example an explicit list of external pointers could be kept if smaller  
than X.
Virtual memory is another issue, but one could assume that the program  
is all in real memory (tackling that needs OS collaboration).

This is a discussion *without* shared,... using it should allow one to  
reduce the communication of pointers retained by other threads to  
shared objects. and make a fully local gc for the non shared objects,  
without needing to synchronize with other threads.
This is an advantage, but using stack variables, scope variables, good  
escape analysis and automatic collection for what does not escape, a  
good concurrent gc,... the gain offered by shared is not so clear cut  
to me, seeing its complexity.

Page reuse between different local pools is indeed an issue, I would  
say that with 64 bits one can simply give empty pages back to the  
system if they are over a given threshold, whereas with 32 bit indeed  
some more tricks (stealing dirty pages) might be in order. Please note  
that stealing pages from another numa node is not something you want  
to do unless it is unavoidable

ciao
Fawzi