review of std.parallelism
Michel Fortin
michel.fortin at michelf.com
Sun Mar 20 19:44:40 PDT 2011
On 2011-03-20 11:42:04 -0400, dsimcha <dsimcha at yahoo.com> said:
> On 3/19/2011 2:14 PM, Michel Fortin wrote:
>> On 2011-03-19 13:20:00 -0400, dsimcha <dsimcha at yahoo.com> said:
>>
>>> On 3/19/2011 1:09 PM, Michel Fortin wrote:
>>>> For instance:
>>>>
>>>> void main() {
>>>> int sum = 0;
>>>> foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
>>>> sum += value;
>>>> }
>>>> writeln(sum);
>>>> }
>>>>
>>>> The "+=" would need to be an atomic operation to avoid low-level races.
>>>>
>>>> I think that ParallelForeach's opApply should only accept a shared
>>>> delegate. I define shared delegate as a delegate that does not reference
>>>> any non-shared variables of its outer scope. The problem is that DMD
>>>> currently doesn't know how to determine whether a delegate literal is
>>>> shared or not, thus a delegate literal is never shared and if
>>>> ParallelForeach's opApply asked a shared delegate as it should it would
>>>> just not work. Fix DMD to create shared delegate literals where
>>>> appropriate and everything can be guarantied race-free.
>>>
>>> If you want pedal-to-metal parallelism without insane levels of
>>> verbosity, you can't have these safety features.
>>
>> I'm not sure where my proposal asks for more verbosity or less
>> performance. All I can see is a few less casts in std.parallelism and
>> that it'd disallow the case in my example above that is totally wrong.
>> Either you're interpreting it wrong or there are things I haven't
>> thought about (and I'd be happy to know about them).
>>
>> But by looking at all the examples in the documentation, I cannot find
>> one that would need to be changed... well, except the one I'll discuss
>> below.
>>
>
> Ok, I've had some time to think about this. The following example is
> safe, but wouldn't work if I understand correctly what you're proposing:
>
> auto logs = new double[1_000_000];
>
> foreach(i, ref elem; taskPool.parallel(logs)) {
> elem = log(i + 1.0);
> }
>
> The problem is that you're writing to the same array from multiple
> threads, which the compiler would interpret as writing to the same
> variable. However, you can guarantee that no two threads ever write to
> the same element, so it's safe.
I don't see a problem with the above. The array elements you modify are
passed through parallel's opApply which can check easily whether it's
safe or not to pass them by ref to different threads (by checking the
element's size) and allow or disallow the operation accordingly.
It could even do a clever trick to make it safe to pass things such as
elements of array of bytes by ref (by coalescing loop iterations for
all bytes sharing the same word into one task). That might not work for
ranges which are not arrays however.
That said, feel free to suggest more problematic examples.
> Note: I'm aware of the false sharing issue when writing to adjacent
> memory addresses. However, when writing to an array this big it occurs
> to a negligible extent. If you were using a smaller array, then either
> the loop body would be so expensive that the cost of false sharing
> would be negligible, or the whole loop would be too cheap to be worth
> parallelizing.
>
> I'm also aware that word tearing is a concern on some architectures,
> though not x86. IMHO std.parallelism and its documentation should not
> be pedantic about portability to architectures that a D compiler
> doesn't even exist for yet.
No question there.
> Also, your example can be trivially modified to be safe.
>
> void main() {
> int sum = 0;
> foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) {
> synchronized sum += value;
> }
> writeln(sum);
> }
>
> In this case that kills all parallelism, but in more realistic cases I
> use this pattern often. I find it very common to have an expensive
> loop body can be performed in parallel, except for a tiny portion that
> must update a shared data structure. I'm aware that it might be
> possible, in theory, to write this more formally using reduce() or
> something. However:
>
> 1. If the portion of the loop that deals with shared data is very
> small (and therefore the serialization caused by the synchronized block
> is negligible), it's often more efficient to only keep one data
> structure in memory and update it concurrently, rather than use
> stronger isolation between threads like reduce() does, and have to
> maintain one data structure for each thread.
>
> 2. In my experience synchronizing on a small portion of the loop body
> works very well in practice. My general philosophy is that, in a
> library like this, dangerous but useful constructs must be supported
> and treated as innocent until proven guilty, not the other way round.
Your second example is not really a good justification of anything.
I'll refer you to how synchronized classes work. It was decided that
synchronized in a class protects everything that is directly stored in
the class. Anything behind an indirection is considered shared by the
compiler. The implication of this is that if you have an array or a
pointer to something that you want semantically to be protected by the
class's mutex, you have to cast things to unshared. It was decided that
things should be safe against low-level races first, and convenience
was relegated as a secondary concern. I don't like it very much, but
that's what was decided and written in TDPL.
Now we're in the exact same situation (except that no classes are
involved) and you're proposing that for this case we make convenience
prime over low-level race safety? To me this would be an utter lack of
coherency in the design of D's "synchronized" feature to go that route.
For the case above, wouldn't it be better to use an atomic add instead
of a synchronized block? In this case you can make "sum" shared and
have the type system check that everything is safe (using my proposed
rules). And if your data structure is bigger, likely it'll be a
synchronized class and you won't have to resort to bypass type-system
safeties inside your loop (although you might have to inside your
synchronized class, but that's another matter).
--
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/
More information about the Digitalmars-d
mailing list