Poor memory allocation performance with a lot of threads on 36 core machine
Witek via Digitalmars-d
digitalmars-d at puremagic.com
Thu Feb 18 05:00:12 PST 2016
Hi,
I was playing with std.parallelism, and implemented a parallel
NQueens code to count number of solutions to the classical
NQueens problem. I do limited parallel recursion using
taskPool.parallel foreach, and then switch at some level to
serial algorithm. I use return values / atomic variables to count
number of iterations / solutions, and then propagate that up
using return values and aggregate. (I might switch later to the
taskPool.reduce, with custom opAdd on some struct, or use
taskPool.workerLocalStorage which I think is awesome concept).
Anyhow, everything was good on 1 or 2 threads, or maybe few more,
on my laptop with old Dual Core CPU. I was able to speed it up
exactly by a factor of 2x.
I wanted to try it out on bigger machine, so used Amazone AWS EC2
c4.8xlarge instance with 18 cores, 36 hyperthreads/vCPUs, and
results were pretty terrible.
I hardly was able to utilize more than 3% of each CPU, the
program actually significantly slower, which was surprising that
there was no synchronization, false sharing or too fine
parallelism in the program.
strace -f, was showing tons of futex operations, but I know that
std.parallel.Foreach implementation doesn't use synchronization
in the main loop, just some atomic variable reads in case another
thread did a break or throwns an exception. It couldn't be a
bottleneck.
I traced the problem to the array allocations in my
implementation - because I only want to iterate over remaining
possible rows, I keep int[] partialSolution and int[]
availableRows, with invariant that intersection is empty, and the
union contains integers 0 to N-1. (N = size of the problem).
Because partialSolution grows by 1 on each level of recursion, I
just copy it and append single element, then pass it recursively
(either to different thread or just function call, depending how
deep in recursion we already in). It could be solve by using
single-linked list and using common tail for all partialSolution
in different branches, but then - list would be reversed, I would
loose random access, which is little bit annoying (but shouldn't
hurt performance, as I am going to scan entire partialSolution
array anyway probably). And I would still need to allocate a new
list node (which granted could be speed-up using thread local
free-list).
Even bigger problem is availableRows, which I need to remove some
elements from, which equates to allocating new array, and copying
all elements but one. This cannot be done using list. And COW
tree would be too expensive, and would still require some
allocations and possibly rebalancing, etc.
I found that this is indeed a problem, because If I allocate
int[64] x = void; on the stack, and then use
std.internal.ScopeBuffer!int(&x) newAvailableRows;, (which is
safe, because I wait for threads to finish before I exit the
scope, and threads only read from this arrays, before making own
copies), I am able to run my nqueens implementation at full
system utilization (using 36 hyperthreads at 100% each, for a
speed-up of about 21x), and it is able to solve N=18 in 7 seconds
(compared to about 2.5 minutes), with parallel part up to level 8
of recursion (to improve load balancing).
So, the question is, why is D / DMD allocator so slow under heavy
multithreading? The working set is pretty small (few megabytes at
most), so I do not think this is an issue with GC scanning
itself. Can I plug-in tcmalloc / jemalloc, to be used as the
underlying allocator, instead of using glibc? Or is D runtime
using mmap/srbk/etc directly?
Thanks.
More information about the Digitalmars-d
mailing list