multithread/concurrency/parallel methods and performance

Dmitry Olshansky dmitry.olsh at gmail.com
Tue Feb 20 05:43:40 UTC 2018


On Monday, 19 February 2018 at 14:57:22 UTC, SrMordred wrote:
> On Monday, 19 February 2018 at 05:54:53 UTC, Dmitry Olshansky 
> wrote:
>> The operation is trivial and dataset is rather small. In such 
>> cases SIMD with eg array ops is the way to go:
>> result[] = values[] * values2[];
>
> Yes, absolutely right :)
>
> I make a simple example to understand why the threads are not 
> scaling in the way i thought they would.

Yeah, the world is ugly place where trivial math sometimes 
doesn’t work.

I suggest to:
- run with different number of threads from 1 to n
- vary sizes from 100k to 10m

For your numbers - 400ms / 64 is ~ 6ms, if we divide by # cores 
it’s 6/7 ~ 0.86ms which is a deal smaller then a CPU timeslice.

In essence a single core runs fast b/c it doesn’t wait for all 
others to complete via join easily burning its quota in one go. 
In MT I bet some of overhead comes from not all threads finishing 
(and starting) at once, so the join block in the kernel.

You could run your MT code with strace to see if it hits the 
futex call or some such, if it does that’s where you are getting 
delays. (that’s assuming you are on Linux)

std.parallel version is a bit faster b/c I think it caches 
created threadpool so you don’t start threads anew on each run.

> I imagine that, if one core work is done in 200ms a 4 core work 
> will be done in 50ms, plus some overhead, since they are 
> working on separate block of memory, without need of sync, and 
> without false sharing, etc (at least I think i don´t have this 
> problem here).

If you had a long queue of small tasks like that and you don’t 
wait to join all threads untill absolutely required you get near 
perfect scalability. (Unless hitting other bottlenecks like RAM).





More information about the Digitalmars-d-learn mailing list