review of std.parallelism

Sat Mar 19 10:16:02 PDT 2011

On 3/19/2011 12:03 PM, Andrei Alexandrescu wrote:
> On 03/19/2011 02:32 AM, dsimcha wrote:
>> Ok, thanks again for clarifying **how** the docs could be improved. I've
>> implemented the suggestions and generally given the docs a good reading
>> over and clean up. The new docs are at:
>>
>> http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html
>
> * Still no synopsis example that illustrates in a catchy way the most
> attractive artifacts.

I don't see what I could put here that isn't totally redundant with the 
rest of the documentation.  Anything I could think of would basically 
just involve concatentating all the examples.  Furthermore, none of the 
other Phobos modules have this, so I don't know what one should look 
like.  In general I feel like std.parallelism is being held to a 
ridiculously high standard that none of the other Phobos modules 
currently meet.

>
> * "After creation, Task objects are submitted to a TaskPool for
> execution." I understand it's possible to use Task straight as a
> promise/future, so s/are/may be/.

No.  The only way Task is useful is by submitting it to a pool to be 
executed.  (Though this may change, see below.)

> Also it is my understanding that
> TaskPool efficiently adapts the concrete number of CPUs to an arbitrary
> number of tasks. If that's the case, it's worth mentioning.

Isn't this kind of obvious from the examples, etc.?

> * "If a Task has been submitted to a TaskPool instance, is being stored
> in a stack frame, and has not yet finished, the destructor for this
> struct will automatically call yieldWait() so that the task can finish
> and the stack frame can be destroyed safely." At this point in the doc
> the reader doesn't understand that at all because TaskPool has not been
> seen yet. The reader gets worried that she'll be essentially serializing
> the entire process by mistake. Either move this explanation down or
> provide an example.

This is getting ridiculous.  There are too many circular dependencies 
between Task and TaskPool that are impossible to remove here that I'm 
not even going to try.  One or the other has to be introduced first, but 
neither can be explained without mentioning the other.  This is why I 
explain the relationship briefly in the module level summary, so that 
the user has at least some idea.  I think this is about the best I can do.

>
> * Is done() a property?

Yes.  DDoc sucks.

>
> * The example that reads two files at the same time should NOT use
> taskPool. It's just one task, why would the pool ever be needed? If you
> also provided an example that reads n files in memory at the same time
> using a pool, that would illustrate nicely why you need it. If a Task
> can't be launched without being put in a pool, there should be a
> possibility to do so. At my work we have a simple function called
> callInNewThread that does what's needed to launch a function in a new
> thread.

I guess I could add something like this to Task.  Under the hood it 
would (for implementation simplicity, to reuse a bunch of code from 
TaskPool) fire up a new single-thread pool, submit the task, call 
TaskPool.finish(), and then return.  Since you're already creating a new 
thread, the extra overhead of creating a new TaskPool for the thread 
would be negligible and it would massively simplify the implementation. 
  My only concern is that, when combined with scoped versus non-scoped 
tasks (which are definitely here to stay, see below) this small 
convenience function would add way more API complexity than it's worth. 
  Besides, task pools don't need to be created explicitly anyhow. 
That's what the default instantiation is for.  I don't see how 
callInNewThread would really solve much.

Secondly, I think you're reading **WAY** too much into what was meant to 
be a simple example to illustrate usage mechanics.  This is another case 
where I can't think of a small, cute example of where you'd really need 
the pool.  There are plenty of larger examples, but the smallest/most 
self-contained one I can think of is a parallel sort.  I decided to use 
file reading because it was good enough to illustrate the mechanics of 
usage, even if it didn't illustrate a particularly good use case.

>
> * The note below that example gets me thinking: it is an artificial
> limitation to force users of Task to worry about scope and such. One
> should be able to create a Future object (Task I think in your
> terminology), pass it around like a normal value, and ultimately force
> it. This is the case for all other languages that implement futures. I
> suspect the "scope" parameter associated with the delegate a couple of
> definitions below plays a role here, but I think we need to work for
> providing the smoothest interface possible (possibly prompting
> improvements in the language along the way).

This is what TaskPool.task is for.  Maybe this should be moved to the 
top of the definition of TaskPool and emphasized, and the scoped/stack 
allocated versions should be moved below TaskPool and de-emphasized?

At any rate, though, anyone who's playing with parallelism should be an 
advanced enough programmer that concepts like scope and destructors are 
second nature.

>
> * I'm not sure how to interpret the docs for
>
> ReturnType!(F) run(F, Args...)(F fpOrDelegate, ref Args args);
>
> So it's documented but I'm not supposed to care. Why not just remove?
> Surely there must be higher-level examples that clarify that I can use
> delegates etc.

Yes, and there are such examples.  It's just that I want to explicitly 
state what type is returned in this case rather than returning auto.  If 
I omitted the docs for run(), you'd ask what run() is.

>
> * "If you want to escape the Task object from the function in which it
> was created or prefer to heap allocate and automatically submit to the
> pool, see TaskPool.task()." I'm uncomfortable that I need to remember a
> subtle distinction of the same primitive name ("task") depending on
> membership in a TaskPool or not, which is a tenuous relationship to
> remember. How about "scopedTask()" vs. "task()"? Also, it's worth asking
> ourselves what's the relative overhead of closure allocation vs. task
> switching etc. Maybe we get to simplify things a lot at a small cost in
> efficiency.

We absolutely **need** scope delegates for calling stuff where a closure 
can't be allocated due to objects on the stack frame having scoped 
destruction.  This is not going anywhere.  Also, I **much** prefer to 
have everything that does conceptually the same thing have the same 
name.  I think this is clearer, not less clear.

>
> * As I mentioned, in the definition:
>
> Task!(run,TypeTuple!(F,Args)) task(F, Args...)(F fun, Args args);
>
> I can't tell what "run" is.

run() is just the adapter function described right above this code.  I 
cannot for the life of me understand how this could possibly be unclear.

> * "A goto from inside the parallel foreach loop to a label outside the
> loop will result in undefined behavior." Would this be a bug in dmd?

No, it's because a goto of this form has no reasonable, useful 
semantics.  I should probably mention in the docs that the same applies 
to labeled break and continue.

I have no idea what semantics these should have, and even if I did, 
given the long odds that even one person would actually need them, I 
think they'd be more trouble than they're worth to implement.  For 
example, once you break out of a parallel foreach loop to some arbitrary 
address (and different threads can goto different labels, etc.), well, 
it's no longer a parallel foreach loop.  It's just a bunch of completely 
unstructured threading doing god-knows-what.

Therefore, I slapped undefined behavior on it as a big sign that says, 
"Just don't do it."  This also has the advantage that, if anyone ever 
thinks of any good, clearly useful semantics, these will be 
implementable without breaking code later.
>
> * Again: speed of e.g. parallel min/max vs. serial, pi computation etc.
> on a usual machine?

I **STRONGLY** believe this does not belong in API documentation because 
it's too machine specific, compiler specific, stack alignment specific, 
etc. and almost any benchmark worth doing takes up more space than an 
example should. Furthermore, anyone who wants to know this can easily 
time it themselves.  I have absolutely no intention of including this. 
While in general I appreciate and have tried to accommodate your 
suggestions, this is one I'll be standing firm on.

>
> * The join example is fine, but the repetitive code suggests that loops
> should be better:
>
> import std.file;
>
> void main() {
> auto pool = new TaskPool();
> foreach (filename; ["foo.txt", "bar.txt", "baz.txt"]) {
> pool.put(task!read(filename));
> }
>
> // Call join() to guarantee that all tasks are done running, the worker
> // threads have terminated and that the results of all of the tasks can
> // be accessed without any synchronization primitives.
> pool.join();
>
> ubyte[][] data; // void[], meh
> // Use spinWait() since the results are guaranteed to have been computed
> // and spinWait() is the cheapest of the wait functions.
> foreach (task; pool.howDoIEnumerateTasks()) {
> data ~= task1.spinWait();
> }

You can't enumerate the tasks like this, but there's a good reason for 
this limitation.  The design requires the pool not hold onto any 
references to the tasks after they're out of the queue, so that they can 
be safely destroyed as soon as the caller wants to destroy them.  If you 
need to enumerate over them like this, then:

1.  You're probably doing something wrong and parallel foreach would be 
a better choice.

2.  If you insist on doing this, you need to explicitly store them in an 
array or something.

As a more general comment, I know the join() example sucks.  join() is 
actually a seldom-used primitive, not a frequently used one as you 
suggest.  Most of the time you wait for individual tasks (explicitly or 
implicitly through the higher level functions), not the whole pool, to 
finish executing.  I feel like it's obligatory that join() and finish() 
exist, but I never use them myself.  They were much more useful in 
pre-release versions of this lib, when parallel foreach didn't exist and 
it made sense to have explicit arrays of Tasks.  The difficulty I've had 
in coming up with a decent example for them makes me halfway tempted to 
just rip those $#)*% functions out entirely.