[dmd-concurrency] D's Memory Model
Fawzi Mohamed
fawzi at gmx.ch
Mon Feb 8 03:07:36 PST 2010
I was away a couple of days, and coming back I found your interesting
post Robert Jacques.
I had not thought much about false sharing, thanks for having raised
the issue.
In general both for false sharing and for the global GC lock (that is
*really* a bottleneck for current heavily multithreaded code if you
allocate in the threads) I fully agree that a having thread local (or
cpu local) allocators is the correct way out.
Without shared you can have two approaches.
One is the simplistic "stop the world" (but making the mark & co
concurrent). The problem is the latency that is introduced by this
approach, but if done correctly (well parallelized) the ratio GC time/
computation time should not be worse than other approaches.
The other approach is similar to the generational GC: you need to
establish checkpoints, and know which allocations are performed before
or after the checkpoint.
at that point, after the mark phase, you can free the non marked
allocations that were done before the checkpoint.
With multithreading one local pool on the other hand needs to collect
all the "retains" coming from other threads done before the
checkpoint, and then free all the non marked allocation.
This introduces a delay in the deallocation. Generational GC then make
another GC just for the newer generation, without caring to mark
things in the "old" generation, but that is another story.
The communication of the retains of the other threads, and the
establishing of checkpoints can be done either preemptively or upon
request (one can simply have the information stored in some way in the
pools, for example keeping an order in the allocations, and storing a
checkpoint on the stack, that suspends the thread if it wants to
reduce the stack before the mark phase on it is performed).
The thread doing the GC can be one of the running threads (or even
several), or a thread dedicated to it.
In a 64 bit system one can really do a memory partitioning without
fear of exhausting the memory space, thus making the question "which
pool/thread owns this pointer" very fast to answer, with 32 bit is
more tricky.
In general one would not expect heavy sharing between threads so for
example an explicit list of external pointers could be kept if smaller
than X.
Virtual memory is another issue, but one could assume that the program
is all in real memory (tackling that needs OS collaboration).
This is a discussion *without* shared,... using it should allow one to
reduce the communication of pointers retained by other threads to
shared objects. and make a fully local gc for the non shared objects,
without needing to synchronize with other threads.
This is an advantage, but using stack variables, scope variables, good
escape analysis and automatic collection for what does not escape, a
good concurrent gc,... the gain offered by shared is not so clear cut
to me, seeing its complexity.
Page reuse between different local pools is indeed an issue, I would
say that with 64 bits one can simply give empty pages back to the
system if they are over a given threshold, whereas with 32 bit indeed
some more tricks (stealing dirty pages) might be in order. Please note
that stealing pages from another numa node is not something you want
to do unless it is unavoidable
ciao
Fawzi
More information about the dmd-concurrency
mailing list