Create many objects using threads

Tue May 6 09:59:02 PDT 2014

On Tuesday, 6 May 2014 at 15:56:11 UTC, Kapps wrote:
> On Monday, 5 May 2014 at 22:11:39 UTC, Ali Çehreli wrote:
>> On 05/05/2014 02:38 PM, Kapps wrote:
>>
>> > I think that the GC actually blocks when
>> > creating objects, and thus multiple threads creating
>> instances would not
>> > provide a significant speedup, possibly even a slowdown.
>>
>> Wow! That is the case. :)
>>
>> > You'd want to benchmark this to be certain it helps.
>>
>> I did:
>>
>> import std.range;
>> import std.parallelism;
>>
>> class C
>> {}
>>
>> void foo()
>> {
>>    auto c = new C;
>> }
>>
>> void main(string[] args)
>> {
>>    enum totalElements = 10_000_000;
>>
>>    if (args.length > 1) {
>>        foreach (i; iota(totalElements).parallel) {
>>            foo();
>>        }
>>
>>    } else {
>>        foreach (i; iota(totalElements)) {
>>            foo();
>>        }
>>    }
>> }
>>
>> Typical run on my system for "-O -noboundscheck -inline":
>>
>> $ time ./deneme parallel
>>
>> real	0m4.236s
>> user	0m4.325s
>> sys	0m9.795s
>>
>> $ time ./deneme
>>
>> real	0m0.753s
>> user	0m0.748s
>> sys	0m0.003s
>>
>> Ali
>
> Huh, that's a much, much, higher impact than I'd expected.
> I tried with GDC as well (the one in Debian stable, which is 
> unfortunately still 2.055...) and got similar results. I also 
> tried creating only totalCPUs threads and having each of them 
> create NUM_ELEMENTS / totalCPUs objects rather than risking 
> that each creation was a task, and it still seems to be the 
> same.
>
>snip

I tried with using an allocator that never releases memory, 
rounds up to a power of 2, and is lock-free. The results are 
quite a bit better.

shardsoft:~$ ./test
1 sec, 47 ms, 474 μs, and 4 hnsecs
shardsoft:~$ ./test
1 sec, 43 ms, 588 μs, and 2 hnsecs
shardsoft:~$ ./test tasks
692 ms, 769 μs, and 8 hnsecs
shardsoft:~$ ./test tasks
692 ms, 686 μs, and 8 hnsecs
shardsoft:~$ ./test parallel
691 ms, 856 μs, and 9 hnsecs
shardsoft:~$ ./test parallel
690 ms, 22 μs, and 3 hnsecs

I get similar results on my laptop (which is much faster than the 
results I got on it using DMD's malloc):
>test
1 sec, 125 ms, and 847 ╬╝s
>test
1 sec, 125 ms, 741 ╬╝s, and 6 hnsecs

>test tasks
556 ms, 613 ╬╝s, and 8 hnsecs
>test tasks
552 ms and 287 ╬╝s

>test parallel
554 ms, 542 ╬╝s, and 6 hnsecs
>test parallel
551 ms, 514 ╬╝s, and 9 hnsecs

Code:
http://pastie.org/9146326

Unfortunately it doesn't compile with the ancient version of gdc 
available in Debian, so I couldn't test with that. The results 
should be quite a bit better since core.atomic would be faster. 
And frankly, I'm not sure if the allocator actually works 
properly, but it's just for testing purposes anyways.