Poor memory allocation performance with a lot of threads on 36 core machine

Thu Feb 18 05:00:12 PST 2016

Hi,

I was playing with std.parallelism, and implemented a parallel 
NQueens code to count number of solutions to the classical 
NQueens problem. I do limited parallel recursion using 
taskPool.parallel foreach, and then switch at some level to 
serial algorithm. I use return values / atomic variables to count 
number of iterations / solutions, and then propagate that up 
using return values and aggregate. (I might switch later to the 
taskPool.reduce, with custom opAdd on some struct, or use 
taskPool.workerLocalStorage which I think is awesome concept).

Anyhow, everything was good on 1 or 2 threads, or maybe few more, 
on my laptop with old Dual Core CPU. I was able to speed it up 
exactly by a factor of 2x.

I wanted to try it out on bigger machine, so used Amazone AWS EC2 
c4.8xlarge instance with 18 cores, 36 hyperthreads/vCPUs, and 
results were pretty terrible.

I hardly was able to utilize more than 3% of each CPU, the 
program actually significantly slower, which was surprising that 
there was no synchronization, false sharing or too fine 
parallelism in the program.

strace -f, was showing tons of futex operations, but I know that 
std.parallel.Foreach implementation doesn't use synchronization 
in the main loop, just some atomic variable reads in case another 
thread did a break or throwns an exception. It couldn't be a 
bottleneck.

I traced the problem to the array allocations in my 
implementation - because I only want to iterate over remaining 
possible rows, I keep int[] partialSolution and int[] 
availableRows, with invariant that intersection is empty, and the 
union contains integers 0 to N-1. (N = size of the problem).

Because partialSolution grows by 1 on each level of recursion, I 
just copy it and append single element, then pass it recursively 
(either to different thread or just function call, depending how 
deep in recursion we already in). It could be solve by using 
single-linked list and using common tail for all partialSolution 
in different branches, but then - list would be reversed, I would 
loose random access, which is little bit annoying (but shouldn't 
hurt performance, as I am going to scan entire partialSolution 
array anyway probably). And I would still need to allocate a new 
list node (which granted could be speed-up using thread local 
free-list).

Even bigger problem is availableRows, which I need to remove some 
elements from, which equates to allocating new array, and copying 
all elements but one. This cannot be done using list. And COW 
tree would be too expensive, and would still require some 
allocations and possibly rebalancing, etc.

I found that this is indeed a problem, because If I allocate 
int[64] x = void; on the stack, and then use 
std.internal.ScopeBuffer!int(&x) newAvailableRows;, (which is 
safe, because I wait for threads to finish before I exit the 
scope, and threads only read from this arrays, before making own 
copies), I am able to run my nqueens implementation at full 
system utilization (using 36 hyperthreads at 100% each, for a 
speed-up of about 21x), and it is able to solve N=18 in 7 seconds 
(compared to about 2.5 minutes), with parallel part up to level 8 
of recursion (to improve load balancing).

So, the question is, why is D / DMD allocator so slow under heavy 
multithreading? The working set is pretty small (few megabytes at 
most), so I do not think this is an issue with GC scanning 
itself.  Can I plug-in tcmalloc / jemalloc, to be used as the 
underlying allocator, instead of using glibc? Or is D runtime 
using mmap/srbk/etc directly?

Thanks.