<div class="gmail_quote">On Thu, Jan 28, 2010 at 2:52 PM, Walter Bright <span dir="ltr">&lt;<a href="mailto:walter@digitalmars.com">walter@digitalmars.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">


<div bgcolor="#ffffff" text="#000000">I don&#39;t think the message D should send is: &quot;Yes, I know you&#39;re using a

machine that can do atomic 128 bits, but because some machines you

don&#39;t have and don&#39;t care about don&#39;t, you have to go the long way

around the block and use a clumsy, awkward, inefficient workaround.&quot;<br>

<br>

The choices are:<br>

<br>

1. allow atomic access for basic types where the CPU supports it, issue

error when it does not. The compiler makes it clear where the

programmer needs to pay attention on machines that need it, and so the

programmer can select whether to use a (slow) mutex or to redesign the

algorithm.<br>

<br>

2. allow atomic access only for basic types supported by the lowest

common denominator CPU. This requires the user to use workarounds even

on machines that support native operations. It&#39;s like the old days

where programmers were forced to suffer with emulated floating point

even if they&#39;d spent the $$$ for a floating point coprocessor.<br>

<br>

3. allow atomic access for all basic types, emit mutexes for those

types where the CPU does not allow atomic access. Keep in mind that

mutexes can make access 100 times slower or more. Bartosz suggested to

me that silently inserting such mutexes is a bad idea because the

programmer would likely prefer to redesign the code than accept such a

tremendous slowdown, except that the compiler hides such and makes it

hard for him to find.<br>

<br>

<br>

As I&#39;ve said before, I prefer (1) because D is a systems programming

language. It&#39;s mission is not the Java &quot;compile once, run everywhere.&quot;

D should cater to the people who want to get the most out of their

machines, not the least. For example, someone writing a device driver <i>needs</i>

to get all the performance possible. Having some operations be 100x

slower just because of portability is not acceptable.<br></div></blockquote></div><br>I like this analysis in principle but the #3 option has a factor - 100x slower - has this really been tested?  I&#39;ll grant that a full pthreads style mutexes, which are function calls with a lot of overhead and logic built into it, not to mention system calls in some cases, are pretty darn slow.  But once we assume that atomics require a memory barrier of some kind on read, and also that a simple spinlock is good enough for a mutex, I wonder if it is that large.  Contrast these two designs to implement &quot;shared real x; x = x + 1;&quot;<br>

<br>No magic:<br><br>   &lt;memory barrier&gt;<br>   CAS loop to do x = x + 1<br>   &lt;memory barrier&gt;<br>

<br>Versus emulated:<br><br>   &lt;memory barrier&gt;<br>   register int sl_index = int(&amp; x) &amp; 0xFF;<br>   CAS loop to set _spinlock_[sl_index] from 0 to 1<br>   x = x + 1 // no CAS needed here this time<br>   _spinlock_[sl_index] = 0 // no CAS needed to unlock<br>

   &lt;memory barrier&gt;<br>

<br>I assume some of these memory barriers are not needed, but is the second design really 100x slower?  I&#39;d think the CAS is the slowest part followed by the memory barrier, and the rest is fairly minor, right?  The sl_index calculation should be cheap since &amp;x must be in a register already.<br>

<br>Kevin<br><br><br>