std.parallelism curious results

Sun Oct 5 12:49:45 PDT 2014

Two problems, one, you should create your threads outside the 
stop watch, it is not generally a fair comparison in the real 
world. It throws of the results for short tasks.

Second, you are creating one thread per integer, this is bad. Do 
you really want to create 1B threads when you only have probably 
4 cores?

Below there are 4 threads used. Each thread adds up 1/4 of the 
integers. So it is like 4 threads, each adding up 250M integers. 
The speed, compared to a single thread adding up 250M integers, 
shows how much the parallelism costs per thread.

import std.stdio, std.parallelism, std.datetime, std.range, 
core.atomic;

void main()
{	
	StopWatch sw;
	shared ulong sum1 = 0, sum2 = 0, sum3 = 0, time1, time2, time3;

	auto numThreads = 4;
	ulong iter = numThreads*100000UL;

	auto thds = parallel(iota(0, iter, iter/numThreads));

	sw.start();
	foreach(i; thds) { ulong s = 0; for(ulong k = 0; k < 
iter/numThreads; k++) { s += k; } s += i*iter/numThreads; 
atomicOp!"+="(sum1, s); }
	sw.stop(); time1 = sw.peek().usecs;

	sw.reset();	sw.start();	for (ulong i = 0; i < iter; ++i) { sum2 
+= i; } sw.stop(); time2 = sw.peek().usecs;

	writefln("parallel sum : %s, elapsed %s us", sum1, time1);
	writefln("single thread sum : %s, elapsed %s us", sum2, time2);
	writefln("Efficiency : %s%%", 100*time2/time1);
}

http://dpaste.dzfl.pl/bfda7bb2e2b7

Some results:

parallel sum : 79999800000, elapsed 3356 us
single thread sum : 79999800000, elapsed 1984 us Efficiency : 59%

(Not sure all the code is correct, the point is you were creating 
1B threads with 1B atomic operations. The worse possible 
comparison one can do between single and multi-threaded tests.