Create many objects using threads
Kapps via Digitalmars-d-learn
digitalmars-d-learn at puremagic.com
Tue May 6 09:59:02 PDT 2014
On Tuesday, 6 May 2014 at 15:56:11 UTC, Kapps wrote:
> On Monday, 5 May 2014 at 22:11:39 UTC, Ali Çehreli wrote:
>> On 05/05/2014 02:38 PM, Kapps wrote:
>>
>> > I think that the GC actually blocks when
>> > creating objects, and thus multiple threads creating
>> instances would not
>> > provide a significant speedup, possibly even a slowdown.
>>
>> Wow! That is the case. :)
>>
>> > You'd want to benchmark this to be certain it helps.
>>
>> I did:
>>
>> import std.range;
>> import std.parallelism;
>>
>> class C
>> {}
>>
>> void foo()
>> {
>> auto c = new C;
>> }
>>
>> void main(string[] args)
>> {
>> enum totalElements = 10_000_000;
>>
>> if (args.length > 1) {
>> foreach (i; iota(totalElements).parallel) {
>> foo();
>> }
>>
>> } else {
>> foreach (i; iota(totalElements)) {
>> foo();
>> }
>> }
>> }
>>
>> Typical run on my system for "-O -noboundscheck -inline":
>>
>> $ time ./deneme parallel
>>
>> real 0m4.236s
>> user 0m4.325s
>> sys 0m9.795s
>>
>> $ time ./deneme
>>
>> real 0m0.753s
>> user 0m0.748s
>> sys 0m0.003s
>>
>> Ali
>
> Huh, that's a much, much, higher impact than I'd expected.
> I tried with GDC as well (the one in Debian stable, which is
> unfortunately still 2.055...) and got similar results. I also
> tried creating only totalCPUs threads and having each of them
> create NUM_ELEMENTS / totalCPUs objects rather than risking
> that each creation was a task, and it still seems to be the
> same.
>
>snip
I tried with using an allocator that never releases memory,
rounds up to a power of 2, and is lock-free. The results are
quite a bit better.
shardsoft:~$ ./test
1 sec, 47 ms, 474 μs, and 4 hnsecs
shardsoft:~$ ./test
1 sec, 43 ms, 588 μs, and 2 hnsecs
shardsoft:~$ ./test tasks
692 ms, 769 μs, and 8 hnsecs
shardsoft:~$ ./test tasks
692 ms, 686 μs, and 8 hnsecs
shardsoft:~$ ./test parallel
691 ms, 856 μs, and 9 hnsecs
shardsoft:~$ ./test parallel
690 ms, 22 μs, and 3 hnsecs
I get similar results on my laptop (which is much faster than the
results I got on it using DMD's malloc):
>test
1 sec, 125 ms, and 847 ╬╝s
>test
1 sec, 125 ms, 741 ╬╝s, and 6 hnsecs
>test tasks
556 ms, 613 ╬╝s, and 8 hnsecs
>test tasks
552 ms and 287 ╬╝s
>test parallel
554 ms, 542 ╬╝s, and 6 hnsecs
>test parallel
551 ms, 514 ╬╝s, and 9 hnsecs
Code:
http://pastie.org/9146326
Unfortunately it doesn't compile with the ancient version of gdc
available in Debian, so I couldn't test with that. The results
should be quite a bit better since core.atomic would be faster.
And frankly, I'm not sure if the allocator actually works
properly, but it's just for testing purposes anyways.
More information about the Digitalmars-d-learn
mailing list