Windows multi-threading performance issues on multi-core systems only

Tue Dec 15 00:26:11 PST 2009

== Quote from dsimcha (dsimcha at yahoo.com)'s article
> == Quote from Dan (dsstruthers at yahoo.com)'s article
> > I have a question regarding performance issue I am seeing on multicore Windows
> systems.  I am creating many threads to do parallel tasks, and on multicore
> Windows systems the performance is abysmal.  If I use task manager to set the
> processor affinity to a single CPU, the program runs as I would expect.  Without
> that, it takes about 10 times as long to complete.
> > Am I doing something wrong?  I have tried DMD 2.0.37 and DMD 1.0.53 with the
> same results, running the binary on both a dual-core P4 and a newer Core2 duo.
> > Any help is greatly appreciated!
> I've seen this happen before.  Without knowing the details of your code, my best
> guess is that you're getting a lot of contention for the GC lock.  (It could also
> be some other lock, but if it were, there's a good chance you'd already know it
> because it wouldn't be hidden.)  The current GC design isn't very
> multithreading-friendly yet.  It requires a lock on every allocation.
> Furthermore, the array append operator (~=) currently takes the GC lock on **every
> append** to query the GC for info about the memory block that the array points to.
>  There's been plenty of talk about what should be done to eliminate this, but
> nothing has been implemented so far.
> Assuming I am right about why your code is so slow, here's how to deal with it:
> 1.  Cut down on unnecessary memory allocations.  Use structs instead of classes
> where it makes sense.
> 2.  Try to stack allocate stuff.  alloca is your friend.
> 3.  Pre-allocate arrays if you know ahead of time how long they're supposed to be.
>  If you don't know how long they're supposed to be, use std.array.Appender (in D2)
> for now until a better solution gets implemented.  Never use ~= in multithreaded
> code that gets executed a lot.

Yes, I've seen this before, too. But in my muti-threads, the alloc operations
aren't avoiding, so the D's GC should improve it's performance for multi-threads.