The GC and performance, but not what you expect

Tue Jun 10 21:39:15 PDT 2014

On 11/06/2014 6:39 AM, "Nordlöw" via Digitalmars-d wrote:
> Have you benchmarked your lock-free GC against D's builtin with
> different allocation sizes? I've seen comments on the need for
> that...

Only a little.  In scripts where I deliberately introduce contention my 
allocator is quicker. It's much closer when there is little contention. 
  This little test script uses 12 threads to do 100,000 allocations 
each.  It does a run for the sizes:
(16, 32, 64, 512, 512, 64, 32, 16).  (script attached below)

With DMD's built in GC
407 milli seconds (16 bytes)
409 milli seconds (32 bytes)
432 milli seconds (64 bytes)
638 milli seconds (512 bytes)
-collect-
633 milli seconds (512 bytes)
422 milli seconds (64 bytes)
418 milli seconds (32 bytes)
394 milli seconds (16 bytes)

My GC
22 milli seconds
18 milli seconds
27 milli seconds
293 milli seconds
-collect-
194 milli seconds
133 milli seconds
60 milli seconds
62 milli seconds

The timing of the collect is not included because it's not fair to 
benchmark as my GC does not minimise yet.  This means my GC could easily 
run out of memory sooner.  (a job for later)

Some interesting observations are:

512 size allocations don't come from thread specific arenas, so there is 
more contention for the CAS operation.

The allocations after the collection would be mostly from the free list 
so they are slower than the initial fresh allocations.  (List 
maintenance, and zeroing of memory required.  This is the same as the 
built-in GC.)  The only real difference in the speeds after the collect 
is lock contention.

I have a 6 core cpu (hyper threaded to 12).  So contention levels may be 
different on machines with less cores. Not sure how this will change 
things.  It might cause less contention, but blocking threads may slow 
things down on the built-in GC.   Tests are on Windows using 32 bit code.

Cheers
Adam

----

import std.stdio;
import std.datetime;
import std.concurrency;
import core.memory;
import core.thread;

struct LinkedList16{
	int value = 0;
	LinkedList16* next;
}

struct LinkedList32{
	int value = 0;
	LinkedList32* next;
	byte[16] padding;
}

struct LinkedList64{
	int value = 0;
	LinkedList64* next;
	byte[32] padding;
}

struct LinkedList512{
	int value = 0;
	LinkedList512* next;
	byte[256] padding;
}

shared int threadCount = 0;
void main(){
	core.memory.GC.disable();
	doTest!LinkedList16;
	doTest!LinkedList32;
	doTest!LinkedList64;
	doTest!LinkedList512;
	GC.collect();
	doTest!LinkedList512;
	doTest!LinkedList64;
	doTest!LinkedList32;
	doTest!LinkedList16;

}

void doTest(T)(){
	auto start = Clock.currSystemTick();
	foreach(i; 0 .. 12){
		auto tid = spawn(&doSomething!T, thisTid);
		threadCount++;
	}
	while(threadCount >0){};

	auto ln = Clock.currSystemTick() - start;
	writeln(ln.msecs, " milli seconds");
}

void doSomething(T)(Tid tid){
	auto top = new T;
	auto recent = top;

	//Create a linked list
	foreach(i; 1 .. 100_000){
		auto newList = new T;
		newList.value = i % 128;

		recent.next = newList;
		recent = newList;
	}

	//Sum the values.  (Just spends some time walking the memory).
	recent = top;
	long total=0;
	while(recent !is null){
		total += recent.value;
		recent = recent.next;
	}
	//writeln("Total : ", total );
	threadCount--;
}