[dmd-concurrency] word tearing status in today's processors

Wed Jan 27 08:47:46 PST 2010

On Wed, 27 Jan 2010 10:10:49 -0500, Andrei Alexandrescu  
<andrei at erdani.com> wrote:

> Hello,
>
>
> I'm looking _hard data_ on how today's processors address word tearing.  
> As usual, googling for word tearing yields the usual mix of vague  
> information, folklore, and opinionated newsgroup discussions.
>
> In particular:
>
> a) Can we assume that all or most of today's processors are able to  
> write memory at byte level?

Not sure. Both x86 and ARM seem to have set byte instructions.

> b) If not, is it reasonable to have the compiler insert for sub-word  
> shared assignments a call to a function that avoids word tearing by  
> means of a CAS loop?

Yes, in general, though on x86 xchg (not CAS) should be used instead.

> c) For 64-bit data (long and double), am I right in assuming that all  
> non-ancient Intel32 processors do offer a means to atomically assign  
> 64-bit data? (What are those asm instructions?) For processors that  
> don't (Intel or not), can we/should we guarantee at the language level  
> that 64-bit writes are atomic? We could effect that by using e.g. a  
> federation of hashed locks, or even (gasp!) two global locks, one for  
> long and one for double, and do something cleverer when public outrage  
> puts our lives in danger. Java guarantees atomic assignment for volatile  
> data, but I'm not sure what mechanisms implementations use.

The instructions you're looking for is CMPXCHG8B for 32-bit x86 CPUs. It's  
been around since the 486. For other CPUs, they generally use a  
linked-load. From wikipedia:
All of Alpha, PowerPC, MIPS, and ARM have LL/SC instructions: ldl_l/stl_c  
and ldq_l/stq_c (Alpha), lwarx/stwcx (PowerPC), ll/sc (MIPS), and  
ldrex/strex (ARM version 6 and above).

Most platforms provide multiple sets of instructions for different data  
sizes, e.g. ldarx/stdcx for doubleword on the PowerPC.
Some CPUs require the address being accessed exclusively to be configured  
in write-through mode.
Some CPUs track the load-linked address at a cache-line or other  
granularity, such that any modification to any portion of the cache line  
(whether via another core's store-conditional or merely by an ordinary  
store) is sufficient to cause the store-conditional to fail.
All of these platforms provide weak LL/SC. The PowerPC implementation is  
the strongest, allowing an LL/SC pair to wrap loads and even stores to  
other cache lines. This allows it to implement, for example, lock-free  
reference counting in the face of changing object graphs with arbitrary  
counter reuse (which otherwise requires DCAS).

And from an ARM website (STREXD is 64-bit):
ARM LDREX and STREX are available in ARMv6 and above.
ARM LDREXB, LDREXH, LDREXD, STREXB, STREXD, and STREXH are available in  
ARMv6K and above.
All these 32-bit Thumb instructions are available in ARMv6T2 and above,  
except that LDREXD and STREXD are not available in the ARMv7-M profile.

ARM also has had a swap-byte instruction since v4, which may/may not be  
equivalent to LDREXB/STREXB.

So I think it's safe to say that 64-bit writes will be efficient on most  
CPUs out there and making a language level guarantee is okay.

Warning: most of this came from some quick Google searches, so I don't  
know if there's other gotchas out there.

>
> Thanks,
>
> Andrei
> _______________________________________________
> dmd-concurrency mailing list
> dmd-concurrency at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/dmd-concurrency