[dmd-concurrency] draft 7

Tue Feb 2 09:50:27 PST 2010

On Feb 2, 2010, at 6:04 AM, Fawzi Mohamed wrote:
> 
> My imaginary hardware model is the following:
> 
> several processors, each have a separate cache.
> Operations on the cache are kept in a kind of journal and communicated to other processors.
> A processor continuously updates its cache and sends its updates to the other processors (for me it makes no sense to skip this, if you do it than you don't have a shared memory system).
> It might be that different processors see work of other processors delayed or out of order, actually most likely it is so (for performance reasons).

The weird part is that a processor may see the work of another processor out of order because of its own load reordering rather than because the stores were issued out of order.  I think this is why Bartosz has said that you need a barrier at both the read and write locations, and I guess this is the handshake Andrei mentioned.

> * barriers
> A write barrier ensures that all writes done on the cache of processor X (where the barrier was issued) are communicated to the caches of other processors *before* any subsequent write.
> A read barrier insures that all reads done on processor Y (where the barrier was issued) before the barrier are completed before any read after the barrier.

I'd call these a hoist-store barrier and a hoist-load barrier (Alex Terekhov's terminology).  SPARC would call them a LoadStore and a LoadLoad barrier, I believe.  I can never keep the SPARC terminology straight because they use "load" to represent both an actual load from memory and something about what type of code movement the barrier prevents, so the first word is one and the second word is the other.  Saying that a load or store has acquire semantics is a stronger guarantee because it constrains the movement of both loads and stores.

> * atomic load/stores
> atomic load or stores don't really change much, but ensure that a change is done at once, their cost (if the hardware supports them) is typically very small, and often if supported for a given size it is always used (64 bit on 32 processors is an exception).

I've always considered atomic operations to only guarantee that the operation happens as an indivisible unit.  It may still be delayed, and perhaps longer than an op with a barrier since the CPU could reorder operations to occur before it in some cases.  For example, a MOV instruction on x86 is atomic but there's no barrier.

> By the way I found the atomic module in tango difficult to use correctly (maybe I had not understood it), and I rewrote it.

What problems did you have?