std.parallel_algorithm

Mon May 23 04:40:26 PDT 2011

On 5/23/2011 6:10 AM, Russel Winder wrote:
> Jonathan,
>
> On Mon, 2011-05-23 at 02:15 -0700, Jonathan M Davis wrote:
>> On 2011-05-23 02:07, Russel Winder wrote:
>>> On Mon, 2011-05-23 at 08:16 +0200, Andrej Mitrovic wrote:
>>>> Russell, get parallelism.d from here:
>>>> https://github.com/dsimcha/phobos/tree/master/std
>>>
>>> Too many ls in there but I'll assume it's me ;-)
>>>
>>> So the implication is that std.parallelism has changed since the 2.053
>>> release and that std.parallel_algorithm relies on this change?
>>
>> That's what David said in his initial post. He had to make additional changes
>> to std.parallelism for std.parallel_algorithm to work.
>
> Ah, thanks for the reminder.  That'll teach me to skim read, get the
> URL, act, and fail to actually note important information.
>

Interesting.  Thanks.  Unfortunately it doesn't look like merge and 
dotProduct scale to a quad core very well, because I tuned these to be 
very close to the break-even point on a dual core.  This is part of my 
concern:  In the worst case, stuff that's very close to break-even on a 
dual or quad core can get **slower** when run on a 6 or 8 core.  This is 
just a fact of life.  Eventually, I'd like to push the break-even point 
for std.parallelism a little lower and I know roughly how to do it 
(mainly by implementing work stealing instead of a simple FIFO queue). 
However, there are tons of nagging little implementation issues and 
there would still always be a break even point, even if it's lower.

Right now the break-even point seems to be on the order of a few 
microseconds per task, which on modern hardware equates to ~10,000 clock 
cycles.  This is good enough for a lot of practical purposes, but 
obviously not nano-parallelism.  I've made improving this a fairly low 
priority.  Given the implementation difficulties, the fact that it's 
fairly easy to take full advantage of multicore hardware without getting 
down to this granularity (usually if you need more performance you're 
executing some expensive algorithm in a loop and you can just 
parallelize the outer loop and be done with it) and the fact that 
there's plenty of other useful hacking jobs in the D ecosystem, the 
benefits-to-effort ratio doesn't seem that good.