Create many objects using threads

Tue May 6 08:56:09 PDT 2014

On Monday, 5 May 2014 at 22:11:39 UTC, Ali Çehreli wrote:
> On 05/05/2014 02:38 PM, Kapps wrote:
>
> > I think that the GC actually blocks when
> > creating objects, and thus multiple threads creating
> instances would not
> > provide a significant speedup, possibly even a slowdown.
>
> Wow! That is the case. :)
>
> > You'd want to benchmark this to be certain it helps.
>
> I did:
>
> import std.range;
> import std.parallelism;
>
> class C
> {}
>
> void foo()
> {
>     auto c = new C;
> }
>
> void main(string[] args)
> {
>     enum totalElements = 10_000_000;
>
>     if (args.length > 1) {
>         foreach (i; iota(totalElements).parallel) {
>             foo();
>         }
>
>     } else {
>         foreach (i; iota(totalElements)) {
>             foo();
>         }
>     }
> }
>
> Typical run on my system for "-O -noboundscheck -inline":
>
> $ time ./deneme parallel
>
> real	0m4.236s
> user	0m4.325s
> sys	0m9.795s
>
> $ time ./deneme
>
> real	0m0.753s
> user	0m0.748s
> sys	0m0.003s
>
> Ali

Huh, that's a much, much, higher impact than I'd expected.
I tried with GDC as well (the one in Debian stable, which is 
unfortunately still 2.055...) and got similar results. I also 
tried creating only totalCPUs threads and having each of them 
create NUM_ELEMENTS / totalCPUs objects rather than risking that 
each creation was a task, and it still seems to be the same.

Using malloc and emplace instead of new D, results are about 50% 
faster for single-threadeded and ~3-4 times faster for 
multi-threaded (4 cpu 8 thread machine, Linux 64-bit). The 
multi-threaded version is still twice as slow though. On my 
Windows laptop (with the program compiled for 32-bit), it did not 
make a significant difference and the multi-threaded version is 
still 4 times slower.

That being said, I think most malloc implementations while being 
thread-safe, usually use locks or do not scale well.

Code:
import std.range;
import std.parallelism;
import std.datetime;
import std.stdio;
import core.stdc.stdlib;
import std.conv;

class C {}

void foo() {
     //auto c = new C;
     enum size = __traits(classInstanceSize, C);
     void[] mem = malloc(size)[0..size];
     emplace!C(mem);
}

void createFoos(size_t count) {
     foreach(i; 0 .. count) {
         foo();
     }
}

void main(string[] args) {
     StopWatch sw = StopWatch(AutoStart.yes);
     enum totalElements = 10_000_000;
     if (args.length <= 1) {
         foreach (i; iota(totalElements)) {
             foo();
         }
     } else if(args[1] == "tasks") {
         foreach (i; parallel(iota(totalElements))) {
             foo();
         }
     } else if(args[1] == "parallel") {
         for(int i = 0; i < totalCPUs; i++) {
             taskPool.put(task(&createFoos, totalElements / 
totalCPUs));
         }
         taskPool.finish(true);
     } else
         writeln("Unknown argument '", args[1], "'.");
     sw.stop();
     writeln(cast(Duration)sw.peek);
}

Results (Linux 64-bit):
shardsoft:~$ dmd -O -inline -release test.d
shardsoft:~$ ./test
552 ms, 729 μs, and 7 hnsecs
shardsoft:~$ ./test
532 ms, 139 μs, and 5 hnsecs
shardsoft:~$ ./test tasks
1 sec, 171 ms, 126 μs, and 4 hnsecs
shardsoft:~$ ./test tasks
1 sec, 38 ms, 468 μs, and 6 hnsecs
shardsoft:~$ ./test parallel
1 sec, 146 ms, 738 μs, and 2 hnsecs
shardsoft:~$ ./test parallel
1 sec, 268 ms, 195 μs, and 3 hnsecs