review of std.parallelism

Sun Mar 20 08:42:04 PDT 2011

On 3/19/2011 2:14 PM, Michel Fortin wrote:
> On 2011-03-19 13:20:00 -0400, dsimcha <dsimcha at yahoo.com> said:
>
>> On 3/19/2011 1:09 PM, Michel Fortin wrote:
>>> For instance:
>>>
>>> void main() {
>>> int sum = 0;
>>> foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
>>> sum += value;
>>> }
>>> writeln(sum);
>>> }
>>>
>>> The "+=" would need to be an atomic operation to avoid low-level races.
>>>
>>> I think that ParallelForeach's opApply should only accept a shared
>>> delegate. I define shared delegate as a delegate that does not reference
>>> any non-shared variables of its outer scope. The problem is that DMD
>>> currently doesn't know how to determine whether a delegate literal is
>>> shared or not, thus a delegate literal is never shared and if
>>> ParallelForeach's opApply asked a shared delegate as it should it would
>>> just not work. Fix DMD to create shared delegate literals where
>>> appropriate and everything can be guarantied race-free.
>>
>> If you want pedal-to-metal parallelism without insane levels of
>> verbosity, you can't have these safety features.
>
> I'm not sure where my proposal asks for more verbosity or less
> performance. All I can see is a few less casts in std.parallelism and
> that it'd disallow the case in my example above that is totally wrong.
> Either you're interpreting it wrong or there are things I haven't
> thought about (and I'd be happy to know about them).
>
> But by looking at all the examples in the documentation, I cannot find
> one that would need to be changed... well, except the one I'll discuss
> below.
>

Ok, I've had some time to think about this.  The following example is 
safe, but wouldn't work if I understand correctly what you're proposing:

auto logs = new double[1_000_000];

foreach(i, ref elem; taskPool.parallel(logs)) {
     elem = log(i + 1.0);
}

The problem is that you're writing to the same array from multiple 
threads, which the compiler would interpret as writing to the same 
variable.  However, you can guarantee that no two threads ever write to 
the same element, so it's safe.

Note:  I'm aware of the false sharing issue when writing to adjacent 
memory addresses.  However, when writing to an array this big it occurs 
to a negligible extent.  If you were using a smaller array, then either 
the loop body would be so expensive that the cost of false sharing would 
be negligible, or the whole loop would be too cheap to be worth 
parallelizing.

I'm also aware that word tearing is a concern on some architectures, 
though not x86.  IMHO std.parallelism and its documentation should not 
be pedantic about portability to architectures that a D compiler doesn't 
even exist for yet.

Also, your example can be trivially modified to be safe.

void main() {
     int sum = 0;
     foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
          synchronized sum += value;
      }
      writeln(sum);
}

In this case that kills all parallelism, but in more realistic cases I 
use this pattern often.  I find it very common to have an expensive loop 
body can be performed in parallel, except for a tiny portion that must 
update a shared data structure.  I'm aware that it might be possible, in 
theory, to write this more formally using reduce() or something.  However:

1.  If the portion of the loop that deals with shared data is very small 
(and therefore the serialization caused by the synchronized block is 
negligible), it's often more efficient to only keep one data structure 
in memory and update it concurrently, rather than use stronger isolation 
between threads like reduce() does, and have to maintain one data 
structure for each thread.

2.  In my experience synchronizing on a small portion of the loop body 
works very well in practice.  My general philosophy is that, in a 
library like this, dangerous but useful constructs must be supported and 
treated as innocent until proven guilty, not the other way round.