[dmd-concurrency] D's Memory Model

Wed Feb 10 03:59:38 PST 2010

On 10-feb-10, at 03:14, Robert Jacques wrote:

> On Tue, 09 Feb 2010 05:12:29 -0500, Fawzi Mohamed <fawzi at gmx.ch>  
> wrote:
>> On 8-feb-10, at 19:15, Robert Jacques wrote:
>>> On Mon, 08 Feb 2010 06:07:36 -0500, Fawzi Mohamed <fawzi at gmx.ch>  
>>> wrote:
>
> [snip]
>
>>> Concurrent collectors trade throughput (GC time / program time)  
>>> for program responsiveness. They also often trade program  
>>> correctness for latency, as they are notoriously hard to get  
>>> correct for languages with mutation. While, there are some very  
>>> elegance concurrent GCs for functional languages, the high level  
>>> descriptions of the Boehm concurrent GC seems to be inherently  
>>> racy, while Java concurrent GCs tend to be highly instrumented and  
>>> lock-tastic (mark-bit locking on every ref assignment).
>> that can be done with atomic ops...
>
> It can only be done correctly using a DCAS, (not dwCAS), so isn't  
> implementable on any mainstream chip. So you have to lock the mark  
> bits on an object. This can be done as a spin-lock using atomic ops,  
> but that doesn't change the number or frequency they have to be taken.

I might miss something (and I don't know what DCAS is) but as you  
always go in one direction (unmarked -> marked) I don't see why  
something like (for 32 bit)

mark(uint[]bitfield,uint bit){
	auto idx=(bit+31)/32;
	auto mask=1>>(bit&0x1F);
	volatile auto oldV=bitfield[idx];
	while((oldV & mask)==0){
		auto oldV=atomicCAS(bitfield[idx],oldV | mask,oldV);
	}
}

this can be extended if you have 3 states (white,gray,black) for  
concurrent GC or to refCounting.
You have collisions only when two threads try to modify the same  
memory location, and even then given that the function is so fast to  
compute it should not be a big problem.
False sharing makes things worse, but also that should not be  
something that cannot be made acceptable distributing a bit the mark  
bits in the memory.

At the end of the mark phase you need a memory barrier in each working  
thread, and a semaphore (or some flag) to ensure that all changes have  
global visibility, but that is it.

>>> I'd disagree in that memory partitioning should be equally  
>>> difficult on either 32-bit or 64-bit platforms. Consider memory  
>>> pages A B C. If thread 1 allocates some memory and gets page A,  
>>> then thread 2 allocates some memory and gets page B, when thread 1  
>>> next allocates memory it'll get page C and have a discontinuous  
>>> memory set (A C).
>> you can use mmap or similar to reserve big ranges of memory that  
>> you might never actually use (commit), so yes there is a  
>> difference. 32bit you cannot afford reserving large chuncks of  
>> memory to quickly know which thread owns that memory
>
> Interesting, though the fact that this makes each pool finite would  
> appear to be a design flaw to me.

Well you cannot have it both ways, if you know that you have lot of  
virtual space you can make them finite but large reservations, large  
allocations should use another method anyway, and if you really fill a  
pool you might think of using several pools for the same thread...

>>>> In general one would not expect heavy sharing between threads so  
>>>> for example an explicit list of external pointers could be kept  
>>>> if smaller than X.
>>>
>>> Do you know an example of this? Given the Java thread-local  
>>> collection papers, this doesn't seem to be done in practice. Also,  
>>> it seems very overhead intensive and prone to races.
>>
>> well not directly, but I know that for example http://www.research.ibm.com/people/d/dfb/papers/Bacon03Pure.ps 
>>   keeps a list of changes on a per thread basis. That example  
>> actually keeps a reference count instead of a simple mark flag to  
>> avoid having to keep the "externally marked" list explicitly.  
>> Probably a better approach if one goes in that direction.
>
> Ahh, that sounds more reasonable and an interesting hybrid approach  
> to boot. Although, to the best of my knowledge, ref-counting GCs are  
> not 100% correct given the multi-paradigm nature of the D  
> programming language. (To say nothing of efficiency)

ref counting simply allows to immediately discard objects that don't  
build cycles, at the cost of having a counter instead of a simple flag  
(marked/unmarked).
Cycles removal might be delayed, but otherwise correctness/ 
uncorrectness is the same as other gc (it depends only on the  
information you have about pointers, and the probability of treating  
as pointer something that is not one).

> [snip]
>
>>>> This is an advantage, but using stack variables, scope variables,  
>>>> good escape analysis and automatic collection for what does not  
>>>> escape, a good concurrent gc,... the gain offered by shared is  
>>>> not so clear cut to me, seeing its complexity.
>>>
>>> First, full escape analysis is both difficult and slow in  
>>> interpreted languages and impossible in compiled languages such as  
>>> D. So automatic collection is out too.
>>> Second, scope/stack variables are limited in use and size; for  
>>> example, you can't define trees or dynamic arrays on the stack nor  
>>> exceed the stack size (typically 1MB)
>> sure those things alone don't solve all problems, but what does not  
>> fall into them generates so much overhead wrt. a "normal" shared  
>> allocation pools, parallel or concurrent GC to be worthwhile?
>
> I assume you're talking about programmer and not implementation  
> overhead? (Since the implementation overhead is definitely lower)  
> Apple seems to think so. Microsoft and Intel seem to think so too.  
> Pretty much everybody has basically agreed that the "normal" way has  
> failed and we need a different solution. Just that nobody's sure  
> what those solutions should be, so we're getting lots of different  
> answers: lib-dispatch, OpenCL, CUDA, TBB, PLINQ, Erlang, etc.

mmh I think several things are mixed in this, not only GC.
For the GC, as far as I know apple microsoft and intel don't have  
purely local GC, can you point  reference what you were saying? Or  
maybe it was because my statement was kind of badly formulated, I  
meant several pools (see the plural there) and having memory shared by  
everybody.

>>> Third, thread-local mark-sweep GCs actually equal or out perform  
>>> modern concurrent-parallel-generational GCs (according to Apple)
>> I fully believe this, I don't contest that purely local mark and  
>> sweep is more efficient, but is it that much more efficient to  
>> justify the extra complexity in the language?

does apple really have purely thread local GC, I thought they had  
something like a ref counting GC that could be concurrent.

>>> , and that's in a language which supports them. Further,  
>>> concurrent collectors need to access the mark bits of an object.  
>>> In D this means taking the global GC lock on _every_ reference  
>>> assignment and traversing a log N memory tree.
>> setting the mark bits is a prime example of something that can be  
>> done with an atomic op.
>> I haven't understood what you mean with "taking the global GC lock  
>> on _every_ reference assignment and traversing a log N memory tree"  
>> you can find the root pointer and check the mark bit without lock,  
>> only when you think that need to set it you need to use an atomic  
>> op (during the mark phase you can just go unmarked->marked).
>> You mean the operations that are needed by concurrent GC to ensure  
>> that modifications don't make the GC "loose" that are in use if  
>> modified while collecting?
>> So basically the advantage of introducing shared is that concurrent  
>> GC become cheaper and more feasible because the overhead of the  
>> mutation is paid only by shared objects?
>
> Okay, the problem I think you're having is that you're thinking of  
> Java and not of a systems programming language like C, C++ or D. In  
> Java, all heap memory items are objects and there are no interior  
> pointers. So it is trivial to find the mark bits of a class via it's  
> meta-data. You still can't update the mark bits atomically since you  
> need a DCAS to store the reference while making sure the mark bits  
> are still what you think they are. So you have to lock-update- 
> unlock. Anyways, in C/C++/D you have structs, interior pointers and  
> arrays, which all mean to find those pointer bits you have to  
> traverse the memory tree inside the GC. So you need to lock it. By  
> the way, the same thing goes for capacity, which is why the lookup- 
> cache was added to enhance array append performance.

I am confused again, DCAS, locking,...
 From my understanding the structure with the mark bits is not  
modified during the mark phase, only its contents are modified.
Having interior pointers simply means that you need to find the root  
before marking, but finding the root is something that can be done  
without any extra locking (assuming again that during mark phase GC  
structures are not moved, which is given for all GC that I know).
A shared cache for performance reasons, well yes there you need  
locking of some sort (note how preallocated pools make these ops  
faster, in the normal case they are not terrible either, but you need  
extra info to know the layout of the pool, as it isn't fixed).

>>>> Page reuse between different local pools is indeed an issue, I  
>>>> would say that with 64 bits one can simply give empty pages back  
>>>> to the system if they are over a given threshold, whereas with 32  
>>>> bit indeed some more tricks (stealing dirty pages) might be in  
>>>> order. Please note that stealing pages from another numa node is  
>>>> not something you want to do unless it is unavoidable
>>>
>>> I don't see what 64-bit vs 32-bit has to do with this; the former  
>>> simply lets you access more memory, it doesn't magically prevent  
>>> page fragmentation, false sharing, etc.
>> yes but I don't expect page fragmentation to be such a problem that  
>> you need to steal dirty pages from other pools, unless memory is  
>> really tight.
>> false sharing is not an issue if you don't steal pages.
>
> False sharing of local objects isn't, but false sharing of shared/ 
> immutable objects still is.
ok, fair point, so allocation of shared/immutable should be padded if  
one wants to avoid flse sharing.

> And memory can be just as tight on a 64-bit system as a 32-bit  
> system; one just has the ability to see more of it.
yes but I have difficulty thinking about a thread generating lot of  
dirty pages and never using them again, so that stealing is worthwhile.