[dmd-concurrency] draft 7

Fawzi Mohamed fawzi at gmx.ch
Tue Feb 2 06:04:53 PST 2010


On 2-feb-10, at 07:47, Sean Kelly wrote:

> On Feb 1, 2010, at 10:24 PM, Andrei Alexandrescu wrote:
>>
>> If you or I don't know of any machine that would keep on reading a  
>> datum from its own cache instead of consulting the main memory,  
>> we're making an assumption. The right thing to do is to keep the  
>> compiler _AND_ the machine in the loop by using appropriate fencing.
>
> I did some googling before my last reply to try and find a NUMA  
> architecture that works this way, but I didn't have any luck.  Cache  
> coherency is just too popular these days.  But it seems completely  
> reasonable to include such an assertion in a standard as broadly  
> applicable as POSIX.  I'm not sure I agree that fencing is always  
> necessary though--we're really operating at the level of  
> "assumptions regarding hardware behavior" here.

My imaginary hardware model is the following:

several processors, each have a separate cache.
Operations on the cache are kept in a kind of journal and communicated  
to other processors.
A processor continuously updates its cache and sends its updates to  
the other processors (for me it makes no sense to skip this, if you do  
it than you don't have a shared memory system).
It might be that different processors see work of other processors  
delayed or out of order, actually most likely it is so (for  
performance reasons).

* barriers
A write barrier ensures that all writes done on the cache of processor  
X (where the barrier was issued) are communicated to the caches of  
other processors *before* any subsequent write.
A read barrier insures that all reads done on processor Y (where the  
barrier was issued) before the barrier are completed before any read  
after the barrier.
Using them one can ensure the if one before the barrier Y sees a  
change t that was done after the write barrier X, all changes done  
before the write barrier X are visible after the read barrier Y.
Thus barriers can introduce a weak ordering.
That is all they do, they don't synchronize, or ensure that a change  
becomes visible, with the time all changes become visible anyhow,  
otherwise you are not working on a shared memory machine (NUMA or not).

* atomic load/stores
atomic load or stores don't really change much, but ensure that a  
change is done at once, their cost (if the hardware supports them) is  
typically very small, and often if supported for a given size it is  
always used (64 bit on 32 processors is an exception).

* atomic update
this is something much stronger, basically it ensures that an update  
is immediately globally seen, to make its cost reasonable it regards  
just one memory location (some work has to be done so that the update  
is perceived as immediate.
Atomic update a priory does not imply any kind of barrier for other  
memory, only the updated memory location is synchronized.
Atomic operations can be used to implement locks, unique counters,  
when used in combination with barriers.

That is about it, you will not find this hardware definition  
somewhere, but that is my conceptual model, and I think it can be  
applied to all hardware that I know, and, I think, to the future  
hardware too.

You can see that barriers don't imply immediate synchronization  
(unlike atomic updates) but synchronize *all* changes done, itanium  
was toying with the idea of of more local barriers, that synchronize  
only part of the memory, but that is difficult to program for (the  
compiler could probably use them in some cases), and I have ignored it.

Anyway here is my understanding of the situation, and I think it is  
pretty good.

By the way I found the atomic module in tango difficult to use  
correctly (maybe I had not understood it), and I rewrote it.
Due to the recent reorganization in tango an older (buggy) version was  
resurrected, and I don't know what will happen now, probably the new  
one will be merged in, but if you are interested the latest version is  
in blip:
	http://github.com/fawzi/blip/blob/master/blip/sync/Atomic.d
it is apache 2.0, but I am willing to relicense it to whatever if  
deemed useful

As note I would like to say that atomic updates while useful to  
implement some operations in my opinion should not be the method of  
choice for parallel programming (difficult to compose), still in D  
they should definitely be usable by who wants them (i.e. I want a  
working volatile,... at least in D1.0 ;).

Fawzi


More information about the dmd-concurrency mailing list