std.parallelism: Request for Review
dsimcha
dsimcha at yahoo.com
Sun Feb 27 11:09:17 PST 2011
I've looked into this more. I realized that I'm only able to reproduce
it when running Linux in a VM on top of Windows. When I reboot and run
my Linux distro in bare metal instead, I get decent (but not linear)
speedups on the matrix benchmark. I'm guessing this is due to things
like locking and context switches being less efficient/more expensive in
a VM than on bare metal. In your case, having two physical CPUs in
separate sockets probably makes the atomic ops required for locking,
context switches, etc. more expensive. From fiddling around, the GC
thing actually appears to be a non-issue.
Since only the inner loop, not the outer loop, is easily parallelizable,
I think a 256x256 matrix is really at the very edge of what's feasible
in terms of fine-grainedness. Each iteration of the outer loop only
takes on the order of half a millisecond, in serial. This means we're
trying to parallelize an inner loop that only takes on the order of half
a CPU-millisecond to run. (This is the cost of the whole loop, start to
finish, not the cost of one iteration.) Slight changes in the costs of
various primitives (or having more cores to contest locks, invoke
context switches, etc.) can have a huge effect. I've changed to using a
1024x1024 matrix instead, although this seems to be somewhat memory
bandwidth-bound.
As a general statement, these benchmarks are much more fine-grained than
what I use std.parallelism for in the real world, both because
fine-grained examples were the only simple, non-domain-specific,
dependency-free ones I could think of and to show that std.parallelism
works reasonably well (though certainly not perfectly) even with fairly
fine-grained parallelism. The unfortunate reality, though, is that this
kind of micro-parallelism is hard to implement efficiently and will
probably always (on every lib, not just mine) have performance
characteristics that are highly dependent on hardware, OS primitives,
etc. and require some tuning. This isn't to say that std.parallelism is
the best micro-parallelism lib out there, just that I highly doubt that
efficient general-case micro-parallelism is a totally solved problem, or
is even practically solvable, and that these benchmarks illustrate a
far-from-ideal case.
On 2/27/2011 1:44 PM, Russel Winder wrote:
> David,
>
> On Sun, 2011-02-27 at 09:48 -0500, dsimcha wrote:
> [ . . . ]
>> Can you please re-run the benchmark to make sure that this isn't just a
>> one-time anomaly? I can't seem to make the parallel matrix inversion
>> run slower than serial on my hardware, even with ridiculous tuning
>> parameters that I was almost sure would bottleneck the thing on the task
>> queue. Also, all the other benchmarks actually look pretty good.
>
> Sadly the result is consistent :-(
>
> |> matrixInversion
> Inverted a 256 x 256 matrix serially in 60 milliseconds.
> Inverted a 256 x 256 matrix using 4 cores in 76 milliseconds.
> 506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks
> |> matrixInversion
> Inverted a 256 x 256 matrix serially in 58 milliseconds.
> Inverted a 256 x 256 matrix using 4 cores in 84 milliseconds.
> 506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks
> |> matrixInversion
> Inverted a 256 x 256 matrix serially in 61 milliseconds.
> Inverted a 256 x 256 matrix using 4 cores in 65 milliseconds.
> 506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks
> |> matrixInversion
> Inverted a 256 x 256 matrix serially in 58 milliseconds.
> Inverted a 256 x 256 matrix using 4 cores in 76 milliseconds.
> 506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks
> |> matrixInversion
> Inverted a 256 x 256 matrix serially in 59 milliseconds.
> Inverted a 256 x 256 matrix using 4 cores in 84 milliseconds.
> 506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks
> |> matrixInversion
> Inverted a 256 x 256 matrix serially in 58 milliseconds.
> Inverted a 256 x 256 matrix using 4 cores in 84 milliseconds.
> 506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks
> |> matrixInversion
> Inverted a 256 x 256 matrix serially in 58 milliseconds.
> Inverted a 256 x 256 matrix using 4 cores in 84 milliseconds.
> 506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks
> |>
>
>
>
More information about the Digitalmars-d
mailing list