[dmd-concurrency] word tearing status in today's processors

Wed Jan 27 08:40:49 PST 2010

On Wed, Jan 27, 2010 at 10:10 AM, Andrei Alexandrescu <andrei at erdani.com>wrote:

> Hello,
>
>
> I'm looking _hard data_ on how today's processors address word tearing. As
> usual, googling for word tearing yields the usual mix of vague information,
> folklore, and opinionated newsgroup discussions.
>
> In particular:
>
> a) Can we assume that all or most of today's processors are able to write
> memory at byte level?
>
> b) If not, is it reasonable to have the compiler insert for sub-word shared
> assignments a call to a function that avoids word tearing by means of a CAS
> loop?
>
> c) For 64-bit data (long and double), am I right in assuming that all
> non-ancient Intel32 processors do offer a means to atomically assign 64-bit
> data? (What are those asm instructions?) For processors that don't (Intel or
> not), can we/should we guarantee at the language level that 64-bit writes
> are atomic? We could effect that by using e.g. a federation of hashed locks,
> or even (gasp!) two global locks, one for long and one for double, and do
> something cleverer when public outrage puts our lives in danger. Java
> guarantees atomic assignment for volatile data, but I'm not sure what
> mechanisms implementations use.
>
>
> Thanks,
>
> Andrei
> _______________________________________________
> dmd-concurrency mailing list
> dmd-concurrency at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/dmd-concurrency
>

I've been thinking about this a little.

I think a basic hashed locking scheme like the one you mention is a good
idea, especially if you can come up with one that can deal sensibly with
different object sizes.  Using spin locks, you could do almost anything
with a few CAS operations.

I like the idea of a tiered approach as a starting point:

A. Make a list of operations D would like to do with shared objects, fields,
etc.
B. Create a basic implementation of those ops using the global locking
scheme.
C. Anything the processor can help you do better, can be considered an
optimization
    for that platform.

I'm sure some people would see this as too complex, but this approach
separates language capability from platform capability.  It also allows
reasoning and performance comparison of layer B, which different vendors / D
compilers could do differently.

Tier A, the shared operations, could include byte and int64 operations, but
also anything else you'd like to do with a shared object, such as modifying
an associative array's contents or swapping the contents of two adjacent
fields of the same object.  Maybe this should be limited to things
that might conceivably be possible using atomic CPU operations, such as
atomically modifying two adjacent fields together (I'm thinking of array
assignment but it's probably better for layer 1 to be conceived in terms of
what phyical operation is being done), but maybe not something like sorting
an array.

Tier B implements these ops using a simple two-layer locking scheme to lock
memory areas.  Two layers of locks should be enough to handle variable sized
objects if the spinlocks have both a shared and exclusive locking mode.
Use layer 1 to lock disk-page sized regions and layer 2 to lock 8 or 16
byte regions.  To lock anything larger than 8/16 bytes, you lock the layer 1
lock in exclusive mode.  For anything 8/16 bytes or smaller, lock the layer
1 lock in shared mode and the layer 2 lock in exclusive mode.

Tier C is a cook book of tricks for modifying things on particular
platforms, e.g. if you know that byte wise writing or int64 assignment is
possible without tearing you can define replacements for the generic ops for
that platform.

Kevin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/dmd-concurrency/attachments/20100127/31545c35/attachment.htm>