multithread/concurrency/parallel methods and performance
Dmitry Olshansky
dmitry.olsh at gmail.com
Tue Feb 20 05:43:40 UTC 2018
On Monday, 19 February 2018 at 14:57:22 UTC, SrMordred wrote:
> On Monday, 19 February 2018 at 05:54:53 UTC, Dmitry Olshansky
> wrote:
>> The operation is trivial and dataset is rather small. In such
>> cases SIMD with eg array ops is the way to go:
>> result[] = values[] * values2[];
>
> Yes, absolutely right :)
>
> I make a simple example to understand why the threads are not
> scaling in the way i thought they would.
Yeah, the world is ugly place where trivial math sometimes
doesn’t work.
I suggest to:
- run with different number of threads from 1 to n
- vary sizes from 100k to 10m
For your numbers - 400ms / 64 is ~ 6ms, if we divide by # cores
it’s 6/7 ~ 0.86ms which is a deal smaller then a CPU timeslice.
In essence a single core runs fast b/c it doesn’t wait for all
others to complete via join easily burning its quota in one go.
In MT I bet some of overhead comes from not all threads finishing
(and starting) at once, so the join block in the kernel.
You could run your MT code with strace to see if it hits the
futex call or some such, if it does that’s where you are getting
delays. (that’s assuming you are on Linux)
std.parallel version is a bit faster b/c I think it caches
created threadpool so you don’t start threads anew on each run.
> I imagine that, if one core work is done in 200ms a 4 core work
> will be done in 50ms, plus some overhead, since they are
> working on separate block of memory, without need of sync, and
> without false sharing, etc (at least I think i don´t have this
> problem here).
If you had a long queue of small tasks like that and you don’t
wait to join all threads untill absolutely required you get near
perfect scalability. (Unless hitting other bottlenecks like RAM).
More information about the Digitalmars-d-learn
mailing list