Create many objects using threads
Kapps via Digitalmars-d-learn
digitalmars-d-learn at puremagic.com
Tue May 6 08:56:09 PDT 2014
On Monday, 5 May 2014 at 22:11:39 UTC, Ali Çehreli wrote:
> On 05/05/2014 02:38 PM, Kapps wrote:
>
> > I think that the GC actually blocks when
> > creating objects, and thus multiple threads creating
> instances would not
> > provide a significant speedup, possibly even a slowdown.
>
> Wow! That is the case. :)
>
> > You'd want to benchmark this to be certain it helps.
>
> I did:
>
> import std.range;
> import std.parallelism;
>
> class C
> {}
>
> void foo()
> {
> auto c = new C;
> }
>
> void main(string[] args)
> {
> enum totalElements = 10_000_000;
>
> if (args.length > 1) {
> foreach (i; iota(totalElements).parallel) {
> foo();
> }
>
> } else {
> foreach (i; iota(totalElements)) {
> foo();
> }
> }
> }
>
> Typical run on my system for "-O -noboundscheck -inline":
>
> $ time ./deneme parallel
>
> real 0m4.236s
> user 0m4.325s
> sys 0m9.795s
>
> $ time ./deneme
>
> real 0m0.753s
> user 0m0.748s
> sys 0m0.003s
>
> Ali
Huh, that's a much, much, higher impact than I'd expected.
I tried with GDC as well (the one in Debian stable, which is
unfortunately still 2.055...) and got similar results. I also
tried creating only totalCPUs threads and having each of them
create NUM_ELEMENTS / totalCPUs objects rather than risking that
each creation was a task, and it still seems to be the same.
Using malloc and emplace instead of new D, results are about 50%
faster for single-threadeded and ~3-4 times faster for
multi-threaded (4 cpu 8 thread machine, Linux 64-bit). The
multi-threaded version is still twice as slow though. On my
Windows laptop (with the program compiled for 32-bit), it did not
make a significant difference and the multi-threaded version is
still 4 times slower.
That being said, I think most malloc implementations while being
thread-safe, usually use locks or do not scale well.
Code:
import std.range;
import std.parallelism;
import std.datetime;
import std.stdio;
import core.stdc.stdlib;
import std.conv;
class C {}
void foo() {
//auto c = new C;
enum size = __traits(classInstanceSize, C);
void[] mem = malloc(size)[0..size];
emplace!C(mem);
}
void createFoos(size_t count) {
foreach(i; 0 .. count) {
foo();
}
}
void main(string[] args) {
StopWatch sw = StopWatch(AutoStart.yes);
enum totalElements = 10_000_000;
if (args.length <= 1) {
foreach (i; iota(totalElements)) {
foo();
}
} else if(args[1] == "tasks") {
foreach (i; parallel(iota(totalElements))) {
foo();
}
} else if(args[1] == "parallel") {
for(int i = 0; i < totalCPUs; i++) {
taskPool.put(task(&createFoos, totalElements /
totalCPUs));
}
taskPool.finish(true);
} else
writeln("Unknown argument '", args[1], "'.");
sw.stop();
writeln(cast(Duration)sw.peek);
}
Results (Linux 64-bit):
shardsoft:~$ dmd -O -inline -release test.d
shardsoft:~$ ./test
552 ms, 729 μs, and 7 hnsecs
shardsoft:~$ ./test
532 ms, 139 μs, and 5 hnsecs
shardsoft:~$ ./test tasks
1 sec, 171 ms, 126 μs, and 4 hnsecs
shardsoft:~$ ./test tasks
1 sec, 38 ms, 468 μs, and 6 hnsecs
shardsoft:~$ ./test parallel
1 sec, 146 ms, 738 μs, and 2 hnsecs
shardsoft:~$ ./test parallel
1 sec, 268 ms, 195 μs, and 3 hnsecs
More information about the Digitalmars-d-learn
mailing list