[dmd-concurrency] real

Thu Jan 28 20:21:25 PST 2010

On Thu, Jan 28, 2010 at 2:52 PM, Walter Bright <walter at digitalmars.com>wrote:

> I don't think the message D should send is: "Yes, I know you're using a
> machine that can do atomic 128 bits, but because some machines you don't
> have and don't care about don't, you have to go the long way around the
> block and use a clumsy, awkward, inefficient workaround."
>
> The choices are:
>
> 1. allow atomic access for basic types where the CPU supports it, issue
> error when it does not. The compiler makes it clear where the programmer
> needs to pay attention on machines that need it, and so the programmer can
> select whether to use a (slow) mutex or to redesign the algorithm.
>
> 2. allow atomic access only for basic types supported by the lowest common
> denominator CPU. This requires the user to use workarounds even on machines
> that support native operations. It's like the old days where programmers
> were forced to suffer with emulated floating point even if they'd spent the
> $$$ for a floating point coprocessor.
>
> 3. allow atomic access for all basic types, emit mutexes for those types
> where the CPU does not allow atomic access. Keep in mind that mutexes can
> make access 100 times slower or more. Bartosz suggested to me that silently
> inserting such mutexes is a bad idea because the programmer would likely
> prefer to redesign the code than accept such a tremendous slowdown, except
> that the compiler hides such and makes it hard for him to find.
>
>
> As I've said before, I prefer (1) because D is a systems programming
> language. It's mission is not the Java "compile once, run everywhere." D
> should cater to the people who want to get the most out of their machines,
> not the least. For example, someone writing a device driver *needs* to get
> all the performance possible. Having some operations be 100x slower just
> because of portability is not acceptable.
>

I like this analysis in principle but the #3 option has a factor - 100x
slower - has this really been tested?  I'll grant that a full pthreads style
mutexes, which are function calls with a lot of overhead and logic built
into it, not to mention system calls in some cases, are pretty darn slow.
But once we assume that atomics require a memory barrier of some kind on
read, and also that a simple spinlock is good enough for a mutex, I wonder
if it is that large.  Contrast these two designs to implement "shared real
x; x = x + 1;"

No magic:

   <memory barrier>
   CAS loop to do x = x + 1
   <memory barrier>

Versus emulated:

   <memory barrier>
   register int sl_index = int(& x) & 0xFF;
   CAS loop to set _spinlock_[sl_index] from 0 to 1
   x = x + 1 // no CAS needed here this time
   _spinlock_[sl_index] = 0 // no CAS needed to unlock
   <memory barrier>

I assume some of these memory barriers are not needed, but is the second
design really 100x slower?  I'd think the CAS is the slowest part followed
by the memory barrier, and the rest is fairly minor, right?  The sl_index
calculation should be cheap since &x must be in a register already.

Kevin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/dmd-concurrency/attachments/20100128/3d0fe6d3/attachment-0001.htm>