David Simcha's std.parallelism
dsimcha
dsimcha at yahoo.com
Sun Jan 9 08:51:42 PST 2011
On 1/1/2011 6:07 PM, Andrei Alexandrescu wrote:
> * parallel is templated on range, but not on operation. Does this affect
> speed for brief operations (such as the one given in the example,
> squares[i] = i * i)? I wonder if using an alias wouldn't be more
> appropriate. Some performance numbers would be very useful in any case.
Ok, I did the benchmarks. Since map is templated on the operation, I
used that as a benchmark of the templating on operation scenario.
Here's the benchmark:
import std.parallelism, std.stdio, std.datetime, std.range, std.conv,
std.math;
int fun1(int num) {
return roundTo!int(sqrt(num));
}
int fun2(int num) {
return num * num;
}
alias fun2 fun;
void main() {
auto foo = array(iota(10_000_000));
auto bar = new int[foo.length];
enum workUnitSize = 1_000_000;
auto sw = StopWatch(autoStart);
foreach(i, elem; parallel(foo, workUnitSize)) {
bar[i] = fun(elem);
}
writeln("Parallel Foreach: ", sw.peek.milliseconds);
sw = StopWatch(autoStart);
bar = taskPool.map!fun(foo, workUnitSize, bar);
writeln("Map: ", sw.peek.milliseconds);
sw = StopWatch(autoStart);
foreach(i, elem; foo) {
bar[i] = fun(elem);
}
writeln("Serial: ", sw.peek.milliseconds);
}
Results:
Parallel Foreach: 69.2988
Map: 29.1973
Serial: 40.2884
So obviously there's a huge penalty when the loop body is super cheap.
On the other hand, when I make fun1 the loop body instead (and it's
still a fairly cheap body), the differences are buried in noise.
Now that I've given my honest report of the facts, though, I'd like to
say that even so, I'm in favor of leaving things as-is, for the
following reasons:
1. Super cheap loop bodies are usually not worth parallelizing anyhow.
You get nowhere near a linear speedup due to memory bandwidth issues,
etc., and if some super cheap loop body is your main bottleneck, it's
probably being executed in in some outer loop and it may make more sense
to parallelize the outer loop. In all my experience with
std.parallelism, I've **never** had the the need/desire to resort to
parallelism fine grained enough that the limitations of delegate-based
parallel foreach mattered in practice.
2. If you really want to parallelize super cheap loop bodies, map()
isn't going anywhere and that and/or reduce(), which also uses
templates, will usually do what you need. You can even use parallel map
in place by simply passing in the same (writeable) range for both the
input and the buffer.
3. The foreach syntax makes the following very useful things (as in, I
actually use them regularly) possible that wouldn't be possible if we
used templates:
foreach(index, elem; parallel(range))
foreach(ref elem; parallel(range))
It also just plain looks nice.
4. A major point of parallel foreach is that variables in the outer
scope "just work". When passing blocks of code as aliases instead of
delegates, this is still very buggy.
5. I'm hoping I can convince Walter to implement an alias-based version
of opApply, which is half-implemented and commented out in the DMD
source code. If this were implemented, I'd change std.parallelism to
use it and this whole discussion would be moot.
More information about the Digitalmars-d
mailing list