[dmd-concurrency] D's Memory Model

Tue Feb 9 02:12:29 PST 2010

On 8-feb-10, at 19:15, Robert Jacques wrote:

> On Mon, 08 Feb 2010 06:07:36 -0500, Fawzi Mohamed <fawzi at gmx.ch>  
> wrote:
>> I was away a couple of days, and coming back I found your  
>> interesting post Robert Jacques.
>>
>> I had not thought much about false sharing, thanks for having  
>> raised the issue.
>>
>> In general both for false sharing and for the global GC lock (that  
>> is *really* a bottleneck for current heavily multithreaded code if  
>> you allocate in the threads) I fully agree that a having thread  
>> local (or cpu local) allocators is the correct way out.
>>
>> Without shared you can have two approaches.
>
> I'm assuming you are talking the collection, not allocation phase.  
> The empirical tests I've seen indicate that currently the allocation  
> bottleneck dwarfs the collection bottleneck.

yes I fully agree, note "thread local (or cpu local) *allocators*".
The working for a thread local pool for allocation can be basically  
the same as the current one, one can replicate the pools for "small"  
objects. So I did concentrate on how to handle the collection step.

>> One is  the simplistic "stop the world" (but making the mark & co  
>> concurrent). The problem is the latency that is introduced by this  
>> approach, but if done correctly (well parallelized) the ratio GC  
>> time/computation time should not be worse than other approaches.
>
> Your statement is a bit confusing since concurrent collectors are  
> generally not considered "stop the world".

ok I should have said mark & co parallel, I was not implying that the  
resulting GC is what is called a concurrent GC.

> Concurrent collectors trade throughput (GC time / program time) for  
> program responsiveness. They also often trade program correctness  
> for latency, as they are notoriously hard to get correct for  
> languages with mutation. While, there are some very elegance  
> concurrent GCs for functional languages, the high level descriptions  
> of the Boehm concurrent GC seems to be inherently racy, while Java  
> concurrent GCs tend to be highly instrumented and lock-tastic (mark- 
> bit locking on every ref assignment).
that can be done with atomic ops...

> As D is a multi-paradigm language which allows C function calls,  
> most of the existing concurrent collector algorithms don't seem to  
> be applicable to D. (Please feel free to correct me if you know of a  
> paper on a concurrent GC that would be good for D)
Well all moving gc are not ok, but other can be done (see later)
>
> Parallel collectors, on the other hand, address throughput and come  
> with no theoretical issues. As parallel GCs can be "stop the world",  
> this is what I assumed you meant.

indeed that is what I meant

>> The other approach is similar to the generational GC: you need to  
>> establish checkpoints, and know which allocations are performed  
>> before or after the checkpoint.
>> at that point, after the mark phase, you can free the non marked  
>> allocations that were done before the checkpoint.
>> With multithreading one local pool on the other hand needs to  
>> collect all the "retains" coming from other threads done before the  
>> checkpoint, and then free all the non marked allocation.
>
> This sounds like "very concurrent garbage collection" which is one  
> of the elegant but immutable-objects-only collectors.
mmh I fear you are right... it did seem a little to simple...
one should always think a little more before posting...

> Generational collectors are a form of escape analysis: when a new/ 
> young object reference is assigned to an old object's member  
> variable it escapes to the older generation. To be efficient,  
> generational collectors need bump-allocators (and a moving/coping  
> GC) or an "objects only" language, so one can efficiency edit a  
> class's meta-data. D's use of interior pointers, structs and array  
> slices, not to mention C function calls, make generational  
> collectors difficult to implement correctly, let alone efficiently.
yes I was not proposing a generational collector.

>> This introduces a delay in the deallocation. Generational GC then  
>> make another GC just for the newer generation, without caring to  
>> mark things in the "old" generation, but that is another story.
>>
>> The communication of the retains of the other threads, and the  
>> establishing of checkpoints can be done either preemptively or upon  
>> request (one can simply have the information stored in some way in  
>> the pools, for example keeping an order in the allocations, and  
>> storing a checkpoint on the stack, that suspends the thread if it  
>> wants to reduce the stack before the mark phase on it is performed).
>> The thread doing the GC can be one of the running threads (or even  
>> several), or a thread dedicated to it.
>>
>> In a 64 bit system one can really do a memory partitioning without  
>> fear of exhausting the memory space, thus making the question  
>> "which pool/thread owns this pointer" very fast to answer, with 32  
>> bit is more tricky.
>
> I'd disagree in that memory partitioning should be equally difficult  
> on either 32-bit or 64-bit platforms. Consider memory pages A B C.  
> If thread 1 allocates some memory and gets page A, then thread 2  
> allocates some memory and gets page B, when thread 1 next allocates  
> memory it'll get page C and have a discontinuous memory set (A C).
you can use mmap or similar to reserve big ranges of memory that you  
might never actually use (commit), so yes there is a difference. 32bit  
you cannot afford reserving large chuncks of memory to quickly know  
which thread owns that memory
>
>> In general one would not expect heavy sharing between threads so  
>> for example an explicit list of external pointers could be kept if  
>> smaller than X.
>
> Do you know an example of this? Given the Java thread-local  
> collection papers, this doesn't seem to be done in practice. Also,  
> it seems very overhead intensive and prone to races.

well not directly, but I know that for example http://www.research.ibm.com/people/d/dfb/papers/Bacon03Pure.ps 
  keeps a list of changes on a per thread basis. That example actually  
keeps a reference count instead of a simple mark flag to avoid having  
to keep the "externally marked" list explicitly. Probably a better  
approach if one goes in that direction.

>> Virtual memory is another issue, but one could assume that the  
>> program is all in real memory (tackling that needs OS collaboration).
>>
>> This is a discussion *without* shared,... using it should allow one  
>> to reduce the communication of pointers retained by other threads  
>> to shared objects. and make a fully local gc for the non shared  
>> objects, without needing to synchronize with other threads.
>
> Yes, this is called thread-local garbage collection (which I  
> mentioned).
>
>> This is an advantage, but using stack variables, scope variables,  
>> good escape analysis and automatic collection for what does not  
>> escape, a good concurrent gc,... the gain offered by shared is not  
>> so clear cut to me, seeing its complexity.
>
> First, full escape analysis is both difficult and slow in  
> interpreted languages and impossible in compiled languages such as  
> D. So automatic collection is out too.
> Second, scope/stack variables are limited in use and size; for  
> example, you can't define trees or dynamic arrays on the stack nor  
> exceed the stack size (typically 1MB)
sure those things alone don't solve all problems, but what does not  
fall into them generates so much overhead wrt. a "normal" shared  
allocation pools, parallel or concurrent GC to be worthwhile?

> Third, thread-local mark-sweep GCs actually equal or out perform  
> modern concurrent-parallel-generational GCs (according to Apple)
I fully believe this, I don't contest that purely local mark and sweep  
is more efficient, but is it that much more efficient to justify the  
extra complexity in the language?

> , and that's in a language which supports them. Further, concurrent  
> collectors need to access the mark bits of an object. In D this  
> means taking the global GC lock on _every_ reference assignment and  
> traversing a log N memory tree.
setting the mark bits is a prime example of something that can be done  
with an atomic op.
I haven't understood what you mean with "taking the global GC lock on  
_every_ reference assignment and traversing a log N memory tree" you  
can find the root pointer and check the mark bit without lock, only  
when you think that need to set it you need to use an atomic op  
(during the mark phase you can just go unmarked->marked).
You mean the operations that are needed by concurrent GC to ensure  
that modifications don't make the GC "loose" that are in use if  
modified while collecting?
So basically the advantage of introducing shared is that concurrent GC  
become cheaper and more feasible because the overhead of the mutation  
is paid only by shared objects?

> Even in Java, where the lack of interior pointers and an object-only  
> paradigm allows the use of two-three atomic instructions on every  
> assignment, eliminating this unneeded and excess overhead would be a  
> huge win.
I am not sure that I understand this excess overhead, if the objects  
are local there shouldn't be many problems, and even for the shared  
ones I don't expect lot of problems.
mmh on a second reading I think that you probably refer again at the  
overhead to make concurrent GC feasible, and ensure that mutations  
don't "loose" an object.

> Fourth, Bearophile brought this up on the newsgroup; one of the  
> biggest mistakes of Java is considered to be that every object is  
> also a monitor. Without shared, D would have to make the same mistake.
why? explicit locks are not an option? I use mutex,... quite often...

>> Page reuse between different local pools is indeed an issue, I  
>> would say that with 64 bits one can simply give empty pages back to  
>> the system if they are over a given threshold, whereas with 32 bit  
>> indeed some more tricks (stealing dirty pages) might be in order.  
>> Please note that stealing pages from another numa node is not  
>> something you want to do unless it is unavoidable
>
> I don't see what 64-bit vs 32-bit has to do with this; the former  
> simply lets you access more memory, it doesn't magically prevent  
> page fragmentation, false sharing, etc.
yes but I don't expect page fragmentation to be such a problem that  
you need to steal dirty pages from other pools, unless memory is  
really tight.
false sharing is not an issue if you don't steal pages.