std.parallelism: Request for Review

Sun Feb 27 08:36:02 PST 2011

On 2/27/2011 9:48 AM, dsimcha wrote:
> On 2/27/2011 8:03 AM, Russel Winder wrote:
>> 32-bit mode on a 8-core (twin Xeon) Linux box. That core.cpuid bug
>> really, really sucks.
>>
>> I see matrix inversion takes longer with 4 cores than with 1!
>

Actually, I am able to reproduce this, but only on Linux, and I think I 
figured out why.  I think it's related to my Posix workaround for Bug 
3753 (http://d.puremagic.com/issues/show_bug.cgi?id=3753).  This 
workaround causes GC heap allocations to occur in a loop inside the 
matrix inversion routine (one for each call to parallel(), so 256 over 
the course of the benchmark).  This was intended to be a very quick and 
dirty workaround for a DMD bug that I thought would get fixed a long 
time ago.  It also seemed good enough at the time because I was using 
this lib for very coarse grained parallelism, where the effect is 
negligible.

Originally, I was using alloca() all over the place to efficiently deal 
with memory management.  However, under Posix, I ran into Bug 3753 a 
long time ago and put in the following workaround, which simply forwards 
alloca() calls to the GC.  From near the top of parallelism.d:

// Workaround for bug 3753.
version(Posix) {
     // Can't use alloca() because it can't be used with exception
     // handling.
     // Use the GC instead even though it's slightly less efficient.
     void* alloca(size_t nBytes) {
         return GC.malloc(nBytes);
     }
} else {
     // Can really use alloca().
     import core.stdc.stdlib : alloca;
}

In this particular use case the performance hit is probably substantial. 
  There are ways to mitigate it (maybe having TaskPool maintain a free 
list, etc.), but I can't bring myself to put a lot of effort into 
optimizing a workaround for a compiler bug.