David Simcha's std.parallelism

Sun Jan 9 08:51:42 PST 2011

On 1/1/2011 6:07 PM, Andrei Alexandrescu wrote:
> * parallel is templated on range, but not on operation. Does this affect
> speed for brief operations (such as the one given in the example,
> squares[i] = i * i)? I wonder if using an alias wouldn't be more
> appropriate. Some performance numbers would be very useful in any case.

Ok, I did the benchmarks.  Since map is templated on the operation, I 
used that as a benchmark of the templating on operation scenario. 
Here's the benchmark:

import std.parallelism, std.stdio, std.datetime, std.range, std.conv,
     std.math;

int fun1(int num) {
     return roundTo!int(sqrt(num));
}

int fun2(int num) {
     return num * num;
}

alias fun2 fun;

void main() {
     auto foo = array(iota(10_000_000));
     auto bar = new int[foo.length];

     enum workUnitSize = 1_000_000;

     auto sw = StopWatch(autoStart);
     foreach(i, elem; parallel(foo, workUnitSize)) {
         bar[i] = fun(elem);
     }
     writeln("Parallel Foreach:  ", sw.peek.milliseconds);

     sw = StopWatch(autoStart);
     bar = taskPool.map!fun(foo, workUnitSize, bar);
     writeln("Map:  ", sw.peek.milliseconds);

     sw = StopWatch(autoStart);
     foreach(i, elem; foo) {
         bar[i] = fun(elem);
     }
     writeln("Serial:  ", sw.peek.milliseconds);
}

Results:

Parallel Foreach:  69.2988
Map:  29.1973
Serial:  40.2884

So obviously there's a huge penalty when the loop body is super cheap.

On the other hand, when I make fun1 the loop body instead (and it's 
still a fairly cheap body), the differences are buried in noise.

Now that I've given my honest report of the facts, though, I'd like to 
say that even so, I'm in favor of leaving things as-is, for the 
following reasons:

1.  Super cheap loop bodies are usually not worth parallelizing anyhow. 
  You get nowhere near a linear speedup due to memory bandwidth issues, 
etc., and if some super cheap loop body is your main bottleneck, it's 
probably being executed in in some outer loop and it may make more sense 
to parallelize the outer loop.  In all my experience with 
std.parallelism, I've **never** had the the need/desire to resort to 
parallelism fine grained enough that the limitations of delegate-based 
parallel foreach mattered in practice.

2.  If you really want to parallelize super cheap loop bodies, map() 
isn't going anywhere and that and/or reduce(), which also uses 
templates, will usually do what you need.  You can even use parallel map 
in place by simply passing in the same (writeable) range for both the 
input and the buffer.

3.  The foreach syntax makes the following very useful things (as in, I 
actually use them regularly) possible that wouldn't be possible if we 
used templates:

foreach(index, elem; parallel(range))
foreach(ref elem; parallel(range))

It also just plain looks nice.

4.  A major point of parallel foreach is that variables in the outer 
scope "just work".  When passing blocks of code as aliases instead of 
delegates, this is still very buggy.

5.  I'm hoping I can convince Walter to implement an alias-based version 
of opApply, which is half-implemented and commented out in the DMD 
source code.  If this were implemented, I'd change std.parallelism to 
use it and this whole discussion would be moot.