std.parallelism: Request for Review

Sun Feb 27 11:09:17 PST 2011

I've looked into this more.  I realized that I'm only able to reproduce 
it when running Linux in a VM on top of Windows.  When I reboot and run 
my Linux distro in bare metal instead, I get decent (but not linear) 
speedups on the matrix benchmark.  I'm guessing this is due to things 
like locking and context switches being less efficient/more expensive in 
a VM than on bare metal.  In your case, having two physical CPUs in 
separate sockets probably makes the atomic ops required for locking, 
context switches, etc. more expensive.  From fiddling around, the GC 
thing actually appears to be a non-issue.

Since only the inner loop, not the outer loop, is easily parallelizable, 
I think a 256x256 matrix is really at the very edge of what's feasible 
in terms of fine-grainedness.  Each iteration of the outer loop only 
takes on the order of half a millisecond, in serial.  This means we're 
trying to parallelize an inner loop that only takes on the order of half 
a CPU-millisecond to run.  (This is the cost of the whole loop, start to 
finish, not the cost of one iteration.)  Slight changes in the costs of 
various primitives (or having more cores to contest locks, invoke 
context switches, etc.) can have a huge effect.  I've changed to using a 
1024x1024 matrix instead, although this seems to be somewhat memory 
bandwidth-bound.

As a general statement, these benchmarks are much more fine-grained than 
what I use std.parallelism for in the real world, both because 
fine-grained examples were the only simple, non-domain-specific, 
dependency-free ones I could think of and to show that std.parallelism 
works reasonably well (though certainly not perfectly) even with fairly 
fine-grained parallelism.  The unfortunate reality, though, is that this 
kind of micro-parallelism is hard to implement efficiently and will 
probably always (on every lib, not just mine) have performance 
characteristics that are highly dependent on hardware, OS primitives, 
etc. and require some tuning.  This isn't to say that std.parallelism is 
the best micro-parallelism lib out there, just that I highly doubt that 
efficient general-case micro-parallelism is a totally solved problem, or 
is even practically solvable, and that these benchmarks illustrate a 
far-from-ideal case.

On 2/27/2011 1:44 PM, Russel Winder wrote:
> David,
>
> On Sun, 2011-02-27 at 09:48 -0500, dsimcha wrote:
> [ . . . ]
>> Can you please re-run the benchmark to make sure that this isn't just a
>> one-time anomaly?  I can't seem to make the parallel matrix inversion
>> run slower than serial on my hardware, even with ridiculous tuning
>> parameters that I was almost sure would bottleneck the thing on the task
>> queue.  Also, all the other benchmarks actually look pretty good.
>
> Sadly the result is consistent :-(
>
>          |>  matrixInversion
>          Inverted a 256 x 256 matrix serially in 60 milliseconds.
>          Inverted a 256 x 256 matrix using 4 cores in 76 milliseconds.
>          506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks
>          |>  matrixInversion
>          Inverted a 256 x 256 matrix serially in 58 milliseconds.
>          Inverted a 256 x 256 matrix using 4 cores in 84 milliseconds.
>          506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks
>          |>  matrixInversion
>          Inverted a 256 x 256 matrix serially in 61 milliseconds.
>          Inverted a 256 x 256 matrix using 4 cores in 65 milliseconds.
>          506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks
>          |>  matrixInversion
>          Inverted a 256 x 256 matrix serially in 58 milliseconds.
>          Inverted a 256 x 256 matrix using 4 cores in 76 milliseconds.
>          506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks
>          |>  matrixInversion
>          Inverted a 256 x 256 matrix serially in 59 milliseconds.
>          Inverted a 256 x 256 matrix using 4 cores in 84 milliseconds.
>          506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks
>          |>  matrixInversion
>          Inverted a 256 x 256 matrix serially in 58 milliseconds.
>          Inverted a 256 x 256 matrix using 4 cores in 84 milliseconds.
>          506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks
>          |>  matrixInversion
>          Inverted a 256 x 256 matrix serially in 58 milliseconds.
>          Inverted a 256 x 256 matrix using 4 cores in 84 milliseconds.
>          506 anglides:/home/Checkouts/Git/Git/D_StdParallelism/benchmarks
>          |>
>
>
>