review of std.parallelism

Sun Mar 20 19:44:40 PDT 2011

On 2011-03-20 11:42:04 -0400, dsimcha <dsimcha at yahoo.com> said:

> On 3/19/2011 2:14 PM, Michel Fortin wrote:
>> On 2011-03-19 13:20:00 -0400, dsimcha <dsimcha at yahoo.com> said:
>> 
>>> On 3/19/2011 1:09 PM, Michel Fortin wrote:
>>>> For instance:
>>>> 
>>>> void main() {
>>>> int sum = 0;
>>>> foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
>>>> sum += value;
>>>> }
>>>> writeln(sum);
>>>> }
>>>> 
>>>> The "+=" would need to be an atomic operation to avoid low-level races.
>>>> 
>>>> I think that ParallelForeach's opApply should only accept a shared
>>>> delegate. I define shared delegate as a delegate that does not reference
>>>> any non-shared variables of its outer scope. The problem is that DMD
>>>> currently doesn't know how to determine whether a delegate literal is
>>>> shared or not, thus a delegate literal is never shared and if
>>>> ParallelForeach's opApply asked a shared delegate as it should it would
>>>> just not work. Fix DMD to create shared delegate literals where
>>>> appropriate and everything can be guarantied race-free.
>>> 
>>> If you want pedal-to-metal parallelism without insane levels of
>>> verbosity, you can't have these safety features.
>> 
>> I'm not sure where my proposal asks for more verbosity or less
>> performance. All I can see is a few less casts in std.parallelism and
>> that it'd disallow the case in my example above that is totally wrong.
>> Either you're interpreting it wrong or there are things I haven't
>> thought about (and I'd be happy to know about them).
>> 
>> But by looking at all the examples in the documentation, I cannot find
>> one that would need to be changed... well, except the one I'll discuss
>> below.
>> 
> 
> Ok, I've had some time to think about this.  The following example is 
> safe, but wouldn't work if I understand correctly what you're proposing:
> 
> auto logs = new double[1_000_000];
> 
> foreach(i, ref elem; taskPool.parallel(logs)) {
>      elem = log(i + 1.0);
> }
> 
> The problem is that you're writing to the same array from multiple 
> threads, which the compiler would interpret as writing to the same 
> variable.  However, you can guarantee that no two threads ever write to 
> the same element, so it's safe.

I don't see a problem with the above. The array elements you modify are 
passed through parallel's opApply which can check easily whether it's 
safe or not to pass them by ref to different threads (by checking the 
element's size) and allow or disallow the operation accordingly.

It could even do a clever trick to make it safe to pass things such as 
elements of array of bytes by ref (by coalescing loop iterations for 
all bytes sharing the same word into one task). That might not work for 
ranges which are not arrays however.

That said, feel free to suggest more problematic examples.

> Note:  I'm aware of the false sharing issue when writing to adjacent 
> memory addresses.  However, when writing to an array this big it occurs 
> to a negligible extent.  If you were using a smaller array, then either 
> the loop body would be so expensive that the cost of false sharing 
> would be negligible, or the whole loop would be too cheap to be worth 
> parallelizing.
> 
> I'm also aware that word tearing is a concern on some architectures, 
> though not x86.  IMHO std.parallelism and its documentation should not 
> be pedantic about portability to architectures that a D compiler 
> doesn't even exist for yet.

No question there.

> Also, your example can be trivially modified to be safe.
> 
> void main() {
>      int sum = 0;
>      foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
>           synchronized sum += value;
>       }
>       writeln(sum);
> }
> 
> In this case that kills all parallelism, but in more realistic cases I 
> use this pattern often.  I find it very common to have an expensive 
> loop body can be performed in parallel, except for a tiny portion that 
> must update a shared data structure.  I'm aware that it might be 
> possible, in theory, to write this more formally using reduce() or 
> something.  However:
> 
> 1.  If the portion of the loop that deals with shared data is very 
> small (and therefore the serialization caused by the synchronized block 
> is negligible), it's often more efficient to only keep one data 
> structure in memory and update it concurrently, rather than use 
> stronger isolation between threads like reduce() does, and have to 
> maintain one data structure for each thread.
> 
> 2.  In my experience synchronizing on a small portion of the loop body 
> works very well in practice.  My general philosophy is that, in a 
> library like this, dangerous but useful constructs must be supported 
> and treated as innocent until proven guilty, not the other way round.

Your second example is not really a good justification of anything. 
I'll refer you to how synchronized classes work. It was decided that 
synchronized in a class protects everything that is directly stored in 
the class. Anything behind an indirection is considered shared by the 
compiler. The implication of this is that if you have an array or a 
pointer to something that you want semantically to be protected by the 
class's mutex, you have to cast things to unshared. It was decided that 
things should be safe against low-level races first, and convenience 
was relegated as a secondary concern. I don't like it very much, but 
that's what was decided and written in TDPL.

Now we're in the exact same situation (except that no classes are 
involved) and you're proposing that for this case we make convenience 
prime over low-level race safety? To me this would be an utter lack of 
coherency in the design of D's "synchronized" feature to go that route.

For the case above, wouldn't it be better to use an atomic add instead 
of a synchronized block? In this case you can make "sum" shared and 
have the type system check that everything is safe (using my proposed 
rules). And if your data structure is bigger, likely it'll be a 
synchronized class and you won't have to resort to bypass type-system 
safeties inside your loop (although you might have to inside your 
synchronized class, but that's another matter).

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/