Memory Allocation and Locking

Fri Aug 22 12:50:38 PDT 2008

dsimcha wrote:
> I guess this really is more of a C++ question then a D question, but it's kind
> of both, so what the heck?
> 
> Anyhow, I had some old C++ code laying around that I've been using, and I had
> added some quick and dirty multithreading of embarrassingly parallel parts of
> it to it to speed it up.  Lo and behold, it actually worked.  A while later, I
> decided I needed to extend it in ways that couldn't be done nearly as easily
> in C++ as in D, and it wasn't much code, so I just translated it to D.
> 
> First iteration was slow as dirt because of resource contention for memory
> allocation.  Put in a few ugly hacks to avoid all memory allocation in inner
> loops (i.e. ~= ), and now the D version is faster than the C++ version.
> 
> Anyhow, the real quesiton is, why is it that I can get away w/ heap
> allocations in inner loops (i.e. vector.push_back() ) in multithreaded C++
> code but not in multithreaded D code?  I'm assuming, since there didn't seem
> to be any resource contention in the C++ version, that there was no locking
> for the memory allocation, but all of the allocation was done through STL
> rather than explicitly, so I don't really know.  However, the code actually
> worked like this, so I never really gave it much thought until today.  Is it
> just a minor miracle that this worked at all w/o me explicitly using some kind
> of lock?

You can get away with it in C++ because there are no garbage collection 
cycles occurring when the allocator runs low on memory.  To easiest way 
to approach an apples-apples comparison would be to disable the GC at 
the start of your program.

Another performance issue for some programs is a result of how the GC 
allocator obtains memory from the OS.  It does so in bite-sized blocks, 
so if your app does a ton of allocation and then runs the GC is not only 
collecting periodically but then it's also obtaining a chunk of memory 
from the OS to store the latest allocation (with some extra room for 
additional blocks), then it's doing the same thing again, etc.  The 
Tango GC provides a reserve() method to optimize for this situation by 
allowing the user to have the GC pre-build a pool of N bytes for use by 
future allocations.  In my testing this increased app performance by 50% 
or more given certain program designs.

Sean