[dmd-concurrency] D's Memory Model

Wed Feb 10 12:59:36 PST 2010

On Wed, 10 Feb 2010 12:28:39 -0500, Fawzi Mohamed <fawzi at gmx.ch> wrote:
> On 10-feb-10, at 17:41, Sean Kelly wrote:
>> On Feb 10, 2010, at 3:59 AM, Fawzi Mohamed wrote:
>>> On 10-feb-10, at 03:14, Robert Jacques wrote:
>>>> On Tue, 09 Feb 2010 05:12:29 -0500, Fawzi Mohamed <fawzi at gmx.ch>  
>>>> wrote:
>>>>> On 8-feb-10, at 19:15, Robert Jacques wrote:
>>>>>> On Mon, 08 Feb 2010 06:07:36 -0500, Fawzi Mohamed <fawzi at gmx.ch>  
>>>>>> wrote:
>>>>
>>>> [snip]
>>>>
>>>>>> Concurrent collectors trade throughput (GC time / program time) for  
>>>>>> program responsiveness. They also often trade program correctness  
>>>>>> for latency, as they are notoriously hard to get correct for  
>>>>>> languages with mutation. While, there are some very elegance  
>>>>>> concurrent GCs for functional languages, the high level  
>>>>>> descriptions of the Boehm concurrent GC seems to be inherently  
>>>>>> racy, while Java concurrent GCs tend to be highly instrumented and  
>>>>>> lock-tastic (mark-bit locking on every ref assignment).
>>>>> that can be done with atomic ops...
>>>>
>>>> It can only be done correctly using a DCAS, (not dwCAS), so isn't  
>>>> implementable on any mainstream chip. So you have to lock the mark  
>>>> bits on an object. This can be done as a spin-lock using atomic ops,  
>>>> but that doesn't change the number or frequency they have to be taken.
>>>
>>> I might miss something (and I don't know what DCAS is) but as you  
>>> always go in one direction (unmarked -> marked) I don't see why  
>>> something like (for 32 bit)
>>>
>>> mark(uint[]bitfield,uint bit){
>>> 	auto idx=(bit+31)/32;
>>> 	auto mask=1>>(bit&0x1F);
>>> 	volatile auto oldV=bitfield[idx];
>>> 	while((oldV & mask)==0){
>>> 		auto oldV=atomicCAS(bitfield[idx],oldV | mask,oldV);
>>> 	}
>>> }
>>>
>>> this can be extended if you have 3 states (white,gray,black) for  
>>> concurrent GC or to refCounting.
>>> You have collisions only when two threads try to modify the same  
>>> memory location, and even then given that the function is so fast to  
>>> compute it should not be a big problem.
>>> False sharing makes things worse, but also that should not be  
>>> something that cannot be made acceptable distributing a bit the mark  
>>> bits in the memory.
>>>
>>> At the end of the mark phase you need a memory barrier in each working  
>>> thread, and a semaphore (or some flag) to ensure that all changes have  
>>> global visibility, but that is it.
>>
>> This works in a world where all reference updates are done via in- 
>> language operations.  But D supports both extern (C) functions and  
>> inline assembly, both of which defeat incremental collection.
>
> this works with a stop the world approach as well as now (and is  
> parallel).
>
> I agree that to use that in a concurrent GC then you need some way to  
> ensure that an update to a reference does not loose references.
> That is possible only if you have some control on the references  
> updates, so either that has to be exposed to external programs, or you  
> declare programs that modify references transferring the reference to an  
> object from a memory location to another (simply loosing a reference or  
> adding one is not a problem, only the combination is), as unsafe to be  
> run during a GC mark phase if you use a concurrent GC and do not notify  
> it of the change.
> Not the ideal solution, but that is all you can do: ether you stop the  
> threads, or some control on the modifications has to be done...

The DCAS, or double compare and swap, does a compare and swap at two  
different memory locations simultaneously. The problem is that in a  
concurrent/incremental GC, you have to propagate the mark bits correctly.  
And to do this you have to atomically set the mark-bits and the assigned  
reference at the same time...although in writing this I think I figured  
out at least two ways of doing this correctly without a DCAS. One involves  
recursively propagating the mark-bits yourself, which could cause the  
thread to pause for a long time and retain excess memory, or adding the  
new node to a to-mark list, which would add a some serialization.