review of std.parallelism

Sat Mar 19 19:16:32 PDT 2011

On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:
> On 03/19/2011 12:16 PM, dsimcha wrote:
>> On 3/19/2011 12:03 PM, Andrei Alexandrescu wrote:
>>> On 03/19/2011 02:32 AM, dsimcha wrote:
>>>> Ok, thanks again for clarifying **how** the docs could be improved.
>>>> I've
>>>> implemented the suggestions and generally given the docs a good reading
>>>> over and clean up. The new docs are at:
>>>>
>>>> http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html
>>>
>>> * Still no synopsis example that illustrates in a catchy way the most
>>> attractive artifacts.
>>
>> I don't see what I could put here that isn't totally redundant with the
>> rest of the documentation. Anything I could think of would basically
>> just involve concatentating all the examples. Furthermore, none of the
>> other Phobos modules have this, so I don't know what one should look
>> like.
>
> I'm thinking along the lines of:
>
> http://www.digitalmars.com/d/2.0/phobos/std_exception.html
>
> A nice synopsis would be the pi computation. Just move that up to the
> synopsis. It's simple, clean, and easy to relate to. Generally, you'd
> put here not all details but the stuff you think would make it easiest
> for people to get into your library.

Good example, will do.

>>> * "After creation, Task objects are submitted to a TaskPool for
>>> execution." I understand it's possible to use Task straight as a
>>> promise/future, so s/are/may be/.
>>
>> No. The only way Task is useful is by submitting it to a pool to be
>> executed. (Though this may change, see below.)
>
> I very much hope this does change. Otherwise the role of Task in the
> design could be drastically reduced (e.g. nested type inside of
> TaskPool) without prejudice. At the minimum I want to be able to create
> a task, launch it, and check its result later without involving a pool.
> A pool is when I have many tasks that may exceed the number of CPUs etc.
> Simplicity would be great.
>
> // start three reads
> auto readFoo = task!readText("foo.txt");
> auto readBar = task!readText("bar.txt");
> auto readBaz = task!readText("baz.txt");
> // join'em all
> auto foo = readFoo.yieldWait();
> auto bar = readBar.yieldWait();
> auto baz = readBaz.yieldWait();

This is definitely feasible in principle.  I'd like to implement it, but 
there's a few annoying, hairy details standing in the way.  For reasons 
I detailed previously, we need both scoped and non-scoped tasks.  We 
also have alias vs. callable (i.e. function pointer or delegate) tasks. 
  Now we're adding pool vs. new-thread tasks.  This is turning into a 
combinatorial explosion and needs to be simplified somehow.  I propose 
the following:

1.  I've reconsidered and actually like the idea of task() vs. 
scopedTask().  task() returns a pointer on the heap.  scopedTask() 
returns a struct on the stack.  Neither would be a member function of 
TaskPool.

2.  Non-scoped Task pointers would need to be explicitly submitted to 
the task pool via the put() method.  This means getting rid of 
TaskPool.task().

3.  The Task struct would grow a function runInNewThread() or something 
similar.  (If you think this would be a common case, even just execute() 
might cut it.)

The work flow would now be that you call task() to get a heap-allocated 
Task*, or scopedTask to get a stack-allocated Task.  You then call 
either TaskPool.put() to execute it on a pool or Task.runInNewThread() 
to run it in a new thread.  The creation of the Task is completely 
orthogonal to how it's run.

>
> There's no need at this level for a task pool. What would be nice would
> be to have a join() that joins all tasks spawned by the current thread:
>
> // start three reads
> auto readFoo = task!readText("foo.txt");
> auto readBar = task!readText("bar.txt");
> auto readBaz = task!readText("baz.txt");
> // join'em all
> join();
> // fetch results
> auto foo = readFoo.spinWait();
> auto bar = readBar.spinWait();
> auto baz = readBaz.spinWait();
>

I don't understand how this would be a substantial improvement over the 
first example, where you just call yieldWait() on all three. 
Furthermore, implementing join() as shown in this example would require 
some kind of central registry of all tasks/worker threads/task 
pools/something similar, which would be a huge PITA to implement 
efficiently.

>
>> Secondly, I think you're reading **WAY** too much into what was meant to
>> be a simple example to illustrate usage mechanics. This is another case
>> where I can't think of a small, cute example of where you'd really need
>> the pool. There are plenty of larger examples, but the smallest/most
>> self-contained one I can think of is a parallel sort. I decided to use
>> file reading because it was good enough to illustrate the mechanics of
>> usage, even if it didn't illustrate a particularly good use case.
>
> It's impossible to not have a good small example. Sorting is great. You
> have the partition primitive already in std.algorithm, then off you go
> with tasks. Dot product on dense vectors is another good one. There's
> just plenty of operations that people understand are important to make
> fast.

I forgot about std.algorithm.partition.  This makes a parallel quick 
sort so trivial to implement (ignoring the issue of selecting a good 
pivot, which I think can be safely ignored in example code) that it 
might actually make a good example.

>
>>> * "A goto from inside the parallel foreach loop to a label outside the
>>> loop will result in undefined behavior." Would this be a bug in dmd?
>>
>> No, it's because a goto of this form has no reasonable, useful
>> semantics. I should probably mention in the docs that the same applies
>> to labeled break and continue.
>>
>> I have no idea what semantics these should have, and even if I did,
>> given the long odds that even one person would actually need them, I
>> think they'd be more trouble than they're worth to implement. For
>> example, once you break out of a parallel foreach loop to some arbitrary
>> address (and different threads can goto different labels, etc.), well,
>> it's no longer a parallel foreach loop. It's just a bunch of completely
>> unstructured threading doing god-knows-what.
>>
>> Therefore, I slapped undefined behavior on it as a big sign that says,
>> "Just don't do it." This also has the advantage that, if anyone ever
>> thinks of any good, clearly useful semantics, these will be
>> implementable without breaking code later.
>
> Yah, I was actually thinking of disabling goto outside a local delegate
> everywhere.

See the discussion I'm having with Michael Fortin.  My latest idea is 
that break, labeled break/continue, return, and goto should all throw 
exceptions when found inside a parallel foreach loop.  They affect the 
flow of subsequent iterations, and "subsequent iterations" only makes 
sense when the loop is being executed in serial.

>
>>> * Again: speed of e.g. parallel min/max vs. serial, pi computation etc.
>>> on a usual machine?
>>
>> I **STRONGLY** believe this does not belong in API documentation because
>> it's too machine specific, compiler specific, stack alignment specific,
>> etc. and almost any benchmark worth doing takes up more space than an
>> example should. Furthermore, anyone who wants to know this can easily
>> time it themselves. I have absolutely no intention of including this.
>> While in general I appreciate and have tried to accommodate your
>> suggestions, this is one I'll be standing firm on.
>
> If scalability information is present in however a non-committal form,
> then people would be compelled ("ok, so this shape of the loop would
> actually scale linearly with CPUs... neat").

Ok, I thought you were asking for something much more rigorous than 
this.  I therefore didn't want to provide it because I figured that, no 
matter what I did, someone would be able to say that the benchmark is 
flawed some how, yada, yada, yada.  Given how inexact a science 
benchmarking is, I'm still hesitant to put such results in API docs, but 
I can see where you're coming from here.

>
> Speaking of efficiency, I assume parallel foreach uses opApply with a
> delegate and the inherent overhead. So I'm thinking that a practical way
> to take advantage of parallel foreach would be to parallelize at some
> block level and then do a regular foreach inside the block?
>
> foreach (i, ref elem; taskPool.parallel(logs)) {
> foreach (???)
> elem = log(i + 1.0);
> }
>
> How can I arrange things such that I compute e.g. 64 logs serially
> inside each pass?
>

Three things here:

1.  For this kind of nano-parallelism, map() might be better suited.  To 
fill an existing array, just use map's explicit buffer feature.

taskPool.map!log(iota(1, logs.length + 1), logs);

The option of using map() for nano-parallelism is part of my rationale 
for keeping the pretty but mildly inefficient delegate call in parallel 
foreach.

2.  You're severely overestimating the overhead of the delegate call.
log() is a pretty cheap function and even so speedups are decent with 
parallel foreach compared to a regular foreach loop.

import std.stdio, std.parallelism, std.datetime, std.math;

void main() {
     auto logs = new float[10_000_000];
     taskPool();  // Initialize TaskPool before timing.

     auto sw = StopWatch(autoStart);
     foreach(i, ref elem; logs) {
         elem = log(i + 1);
     }
     writeln("Serial:  ", sw.peek.msecs);

     sw.reset();
     foreach(i, ref elem; parallel(logs)) {
         elem = log(i + 1);
     }
     writeln("Parallel Foreach:  ", sw.peek.msecs);
}

Results:

Serial:  619
Parallel Foreach:  388

I'd include parallel map, too, but for some reason (probably stack 
alignment or something) including it changes the results for parallel 
foreach.

3.  If you really want to do as you suggest, just make a chunks range or 
something (maybe this should be in std.range) that iterates over all 
non-overlapping size N slices of a sliceable range.  Use this for the 
parallel loop, then loop over the individual elements of the slice inside.

> I'm confused here. I use join() pretty much all the time. I launch stuff
> (e.g. run a large matrix-vector multiplication for distinct vectors) and
> then I join that. Then again and again. The thread of execution has a
> repeated hourglass shape - it fans out and then becomes single-threaded
> at join points. I assume some of those computations are indeed the
> charter of parallel foreach, but I'd guess not all. But then what do I
> know - I'm the customer :o).

Yes, parallel foreach is idiomatic here, or maybe parallel map.

In the pre-release versions of std.parallelism (my experimentation with 
parallelism libraries in D goes back to mid-2008, over a year before I 
released the first version as Parallelfuture), only the task primitive 
existed.  I discovered that, in practice, creating Task objects in a 
loop is almost always a PITA.  It can also be inefficient in that, in a 
naive implementation all of these exist in memory simultaneously when 
this might be completely unnecessary.  I decided to create higher level 
primitives to handle all the use cases I could think of where you might 
otherwise want to create Task objects in a loop.  If you're explicitly 
creating Task objects in a loop in std.parallelism, I can just about 
guarantee that there's an easier and at least equally efficient way to 
accomplish what you want to accomplish.  If there are any important use 
cases I missed, I'd be much more interested in creating a few more 
high-level primitives rather than making it easier to work with Task 
objects created in a loop.