review of std.parallelism

Sat Mar 19 13:35:01 PDT 2011

On 03/19/2011 12:16 PM, dsimcha wrote:
> On 3/19/2011 12:03 PM, Andrei Alexandrescu wrote:
>> On 03/19/2011 02:32 AM, dsimcha wrote:
>>> Ok, thanks again for clarifying **how** the docs could be improved. I've
>>> implemented the suggestions and generally given the docs a good reading
>>> over and clean up. The new docs are at:
>>>
>>> http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html
>>
>> * Still no synopsis example that illustrates in a catchy way the most
>> attractive artifacts.
>
> I don't see what I could put here that isn't totally redundant with the
> rest of the documentation. Anything I could think of would basically
> just involve concatentating all the examples. Furthermore, none of the
> other Phobos modules have this, so I don't know what one should look
> like.

I'm thinking along the lines of:

http://www.digitalmars.com/d/2.0/phobos/std_exception.html

A nice synopsis would be the pi computation. Just move that up to the 
synopsis. It's simple, clean, and easy to relate to. Generally, you'd 
put here not all details but the stuff you think would make it easiest 
for people to get into your library.

> In general I feel like std.parallelism is being held to a
> ridiculously high standard that none of the other Phobos modules
> currently meet.

I agree, but that goes without saying. This is not a double standard; 
it's a simple means to improve quality of Phobos overall. Clearly 
certain modules that are already in Phobos would not make it if proposed 
today. And that's a good thing! Comparing our current ambitions to the 
quality of the average Phobos module would not help us.

Generally it seems you have grown already tired of the exchange and it 
would take only a few more exchanges for you to say, "you know what, 
either let's get it over it or forget about it."

Please consider for a minute how this is the wrong tone and attitude to 
be having on several levels. This is not a debate and not the place to 
get defensive. Your role in the discussion is not symmetric with the 
others' and at best you'd use the review as an opportunity to improve 
your design, not stick heels in the ground to defend its current 
incarnation (within reason). Your potential customers are attempting to 
help you by asking questions (some of which no doubt are silly) and by 
making suggestions (some of which, again, are ill-founded). 
Nevertheless, we _are_ your potential customers and in a way the 
customer is always right. Your art is to steer customers from what they 
think they want to what you know they need - because you're the expert! 
- and to improve design, nomenclature, and implementation to the extent 
that would help them.

Furthermore, you should expect that the review process will prompt 
changes. My perception is that you consider the submission more or less 
final modulo possibly a few minor nits. You shouldn't. I'm convinced you 
know much more about SMP than most or all others in this group, but in 
no way that means your design has reached perfection and is beyond 
improvement even from a non-expert.

I know you'd have no problem finding the right voice in this discussion 
if you only frame it in the right light. Again, people are trying to 
help (however awkwardly) and in no way is that ridiculous.

>> * "After creation, Task objects are submitted to a TaskPool for
>> execution." I understand it's possible to use Task straight as a
>> promise/future, so s/are/may be/.
>
> No. The only way Task is useful is by submitting it to a pool to be
> executed. (Though this may change, see below.)

I very much hope this does change. Otherwise the role of Task in the 
design could be drastically reduced (e.g. nested type inside of 
TaskPool) without prejudice. At the minimum I want to be able to create 
a task, launch it, and check its result later without involving a pool. 
A pool is when I have many tasks that may exceed the number of CPUs etc. 
Simplicity would be great.

// start three reads
auto readFoo = task!readText("foo.txt");
auto readBar = task!readText("bar.txt");
auto readBaz = task!readText("baz.txt");
// join'em all
auto foo = readFoo.yieldWait();
auto bar = readBar.yieldWait();
auto baz = readBaz.yieldWait();

There's no need at this level for a task pool. What would be nice would 
be to have a join() that joins all tasks spawned by the current thread:

// start three reads
auto readFoo = task!readText("foo.txt");
auto readBar = task!readText("bar.txt");
auto readBaz = task!readText("baz.txt");
// join'em all
join();
// fetch results
auto foo = readFoo.spinWait();
auto bar = readBar.spinWait();
auto baz = readBaz.spinWait();

The way I see it is, task pools are an advanced means that coordinate m 
threads over n CPUs. If I don't care about that (as above) there should 
be no need for a pool at all. (Of course it's fine if used by the 
implementation.)

>> Also it is my understanding that
>> TaskPool efficiently adapts the concrete number of CPUs to an arbitrary
>> number of tasks. If that's the case, it's worth mentioning.
>
> Isn't this kind of obvious from the examples, etc.?

It wasn't to me. I had an intuition that that might be the case, but 
obvious - far from it. The fact that I need a task pool even when I 
don't care about m:n issues made me doubtful.

>> * "If a Task has been submitted to a TaskPool instance, is being stored
>> in a stack frame, and has not yet finished, the destructor for this
>> struct will automatically call yieldWait() so that the task can finish
>> and the stack frame can be destroyed safely." At this point in the doc
>> the reader doesn't understand that at all because TaskPool has not been
>> seen yet. The reader gets worried that she'll be essentially serializing
>> the entire process by mistake. Either move this explanation down or
>> provide an example.
>
> This is getting ridiculous. There are too many circular dependencies
> between Task and TaskPool that are impossible to remove here that I'm
> not even going to try. One or the other has to be introduced first, but
> neither can be explained without mentioning the other. This is why I
> explain the relationship briefly in the module level summary, so that
> the user has at least some idea. I think this is about the best I can do.

Perhaps cross-references could help (see macro XREF). As I mentioned, an 
example here could go a long way, too.

>> * The example that reads two files at the same time should NOT use
>> taskPool. It's just one task, why would the pool ever be needed? If you
>> also provided an example that reads n files in memory at the same time
>> using a pool, that would illustrate nicely why you need it. If a Task
>> can't be launched without being put in a pool, there should be a
>> possibility to do so. At my work we have a simple function called
>> callInNewThread that does what's needed to launch a function in a new
>> thread.
>
> I guess I could add something like this to Task. Under the hood it would
> (for implementation simplicity, to reuse a bunch of code from TaskPool)
> fire up a new single-thread pool, submit the task, call
> TaskPool.finish(), and then return. Since you're already creating a new
> thread, the extra overhead of creating a new TaskPool for the thread
> would be negligible and it would massively simplify the implementation.
> My only concern is that, when combined with scoped versus non-scoped
> tasks (which are definitely here to stay, see below) this small
> convenience function would add way more API complexity than it's worth.
> Besides, task pools don't need to be created explicitly anyhow. That's
> what the default instantiation is for. I don't see how callInNewThread
> would really solve much.

If we sit down for a second and think outside of the current design, the 
simplest Task use would be as such: "I start a task and get a sort of a 
future. I do something else. Then when I nede the result of the future I 
call some method force() against it." Nowhere is the notion of a task 
pool to be found. To me this looks like an artifact of the 
implementation has spilled into the API, which I understand how is 
natural to you. But also it's unnatural to me.

> Secondly, I think you're reading **WAY** too much into what was meant to
> be a simple example to illustrate usage mechanics. This is another case
> where I can't think of a small, cute example of where you'd really need
> the pool. There are plenty of larger examples, but the smallest/most
> self-contained one I can think of is a parallel sort. I decided to use
> file reading because it was good enough to illustrate the mechanics of
> usage, even if it didn't illustrate a particularly good use case.

It's impossible to not have a good small example. Sorting is great. You 
have the partition primitive already in std.algorithm, then off you go 
with tasks. Dot product on dense vectors is another good one. There's 
just plenty of operations that people understand are important to make fast.

So to me it looks like this:

* there are already some good examples of "I want 1-3 of tasks but don't 
care about pooling them": file reading and processing etc.

* there should be 1-2 good example of partitioning a large chunk of work 
into many tasks and let taskPool carry them.

With such, people would clearly understand the capabilities of the 
library and what they can achieve with various levels of involvement.

>> * The note below that example gets me thinking: it is an artificial
>> limitation to force users of Task to worry about scope and such. One
>> should be able to create a Future object (Task I think in your
>> terminology), pass it around like a normal value, and ultimately force
>> it. This is the case for all other languages that implement futures. I
>> suspect the "scope" parameter associated with the delegate a couple of
>> definitions below plays a role here, but I think we need to work for
>> providing the smoothest interface possible (possibly prompting
>> improvements in the language along the way).
>
> This is what TaskPool.task is for. Maybe this should be moved to the top
> of the definition of TaskPool and emphasized, and the scoped/stack
> allocated versions should be moved below TaskPool and de-emphasized?
>
> At any rate, though, anyone who's playing with parallelism should be an
> advanced enough programmer that concepts like scope and destructors are
> second nature.

I agree. That's not a reason to not find good names for entities. To me 
the relationship "global task() <=> scoped delegate; TaskPool method 
task() <=> closure" is inscrutable. I see no way of remembering it 
except rote memorization.

> We absolutely **need** scope delegates for calling stuff where a closure
> can't be allocated due to objects on the stack frame having scoped
> destruction. This is not going anywhere. Also, I **much** prefer to have
> everything that does conceptually the same thing have the same name. I
> think this is clearer, not less clear.

I agree in principle, but this case is just not that.

>> * As I mentioned, in the definition:
>>
>> Task!(run,TypeTuple!(F,Args)) task(F, Args...)(F fun, Args args);
>>
>> I can't tell what "run" is.
>
> run() is just the adapter function described right above this code. I
> cannot for the life of me understand how this could possibly be unclear.

OK.

>> * "A goto from inside the parallel foreach loop to a label outside the
>> loop will result in undefined behavior." Would this be a bug in dmd?
>
> No, it's because a goto of this form has no reasonable, useful
> semantics. I should probably mention in the docs that the same applies
> to labeled break and continue.
>
> I have no idea what semantics these should have, and even if I did,
> given the long odds that even one person would actually need them, I
> think they'd be more trouble than they're worth to implement. For
> example, once you break out of a parallel foreach loop to some arbitrary
> address (and different threads can goto different labels, etc.), well,
> it's no longer a parallel foreach loop. It's just a bunch of completely
> unstructured threading doing god-knows-what.
>
> Therefore, I slapped undefined behavior on it as a big sign that says,
> "Just don't do it." This also has the advantage that, if anyone ever
> thinks of any good, clearly useful semantics, these will be
> implementable without breaking code later.

Yah, I was actually thinking of disabling goto outside a local delegate 
everywhere.

>> * Again: speed of e.g. parallel min/max vs. serial, pi computation etc.
>> on a usual machine?
>
> I **STRONGLY** believe this does not belong in API documentation because
> it's too machine specific, compiler specific, stack alignment specific,
> etc. and almost any benchmark worth doing takes up more space than an
> example should. Furthermore, anyone who wants to know this can easily
> time it themselves. I have absolutely no intention of including this.
> While in general I appreciate and have tried to accommodate your
> suggestions, this is one I'll be standing firm on.

If scalability information is present in however a non-committal form, 
then people would be compelled ("ok, so this shape of the loop would 
actually scale linearly with CPUs... neat").

Speaking of efficiency, I assume parallel foreach uses opApply with a 
delegate and the inherent overhead. So I'm thinking that a practical way 
to take advantage of parallel foreach would be to parallelize at some 
block level and then do a regular foreach inside the block?

foreach (i, ref elem; taskPool.parallel(logs)) {
     foreach (???)
         elem = log(i + 1.0);
}

How can I arrange things such that I compute e.g. 64 logs serially 
inside each pass?

>> * The join example is fine, but the repetitive code suggests that loops
>> should be better:
>>
>> import std.file;
>>
>> void main() {
>> auto pool = new TaskPool();
>> foreach (filename; ["foo.txt", "bar.txt", "baz.txt"]) {
>> pool.put(task!read(filename));
>> }
>>
>> // Call join() to guarantee that all tasks are done running, the worker
>> // threads have terminated and that the results of all of the tasks can
>> // be accessed without any synchronization primitives.
>> pool.join();
>>
>> ubyte[][] data; // void[], meh
>> // Use spinWait() since the results are guaranteed to have been computed
>> // and spinWait() is the cheapest of the wait functions.
>> foreach (task; pool.howDoIEnumerateTasks()) {
>> data ~= task1.spinWait();
>> }
>
> You can't enumerate the tasks like this, but there's a good reason for
> this limitation. The design requires the pool not hold onto any
> references to the tasks after they're out of the queue, so that they can
> be safely destroyed as soon as the caller wants to destroy them. If you
> need to enumerate over them like this, then:
>
> 1. You're probably doing something wrong and parallel foreach would be a
> better choice.
>
> 2. If you insist on doing this, you need to explicitly store them in an
> array or something.
>
> As a more general comment, I know the join() example sucks. join() is
> actually a seldom-used primitive, not a frequently used one as you
> suggest. Most of the time you wait for individual tasks (explicitly or
> implicitly through the higher level functions), not the whole pool, to
> finish executing. I feel like it's obligatory that join() and finish()
> exist, but I never use them myself. They were much more useful in
> pre-release versions of this lib, when parallel foreach didn't exist and
> it made sense to have explicit arrays of Tasks. The difficulty I've had
> in coming up with a decent example for them makes me halfway tempted to
> just rip those $#)*% functions out entirely.

I'm confused here. I use join() pretty much all the time. I launch stuff 
(e.g. run a large matrix-vector multiplication for distinct vectors) and 
then I join that. Then again and again. The thread of execution has a 
repeated hourglass shape - it fans out and then becomes single-threaded 
at join points. I assume some of those computations are indeed the 
charter of parallel foreach, but I'd guess not all. But then what do I 
know - I'm the customer :o).

Andrei