Skynet 1M Fiber microbenchmark in D
ikod
geller.garry at gmail.com
Wed Oct 18 13:37:23 UTC 2017
On Wednesday, 18 October 2017 at 11:52:08 UTC, Biotronic wrote:
> On Wednesday, 18 October 2017 at 11:34:57 UTC, Nordlöw wrote:
>> Another thing...how should the synchronization between the
>> fibers figure out when the total number of fibers have reached
>> one million?...via an atomic counter fed by reference to the
>> constructor...or are there better ways? Because I do need an
>> atomic reference counter here, right?
>
> This is how I did it:
> import core.thread : Fiber;
>
> class MyFiber : Fiber {
> int _depth;
> ulong _index;
> ulong _value;
>
> this(int depth, ulong index) {
> super(&run);
> _depth = depth;
> _index = index;
> }
>
> void run() {
> if (_depth == 6) { // 10^6 == 1 million, so stop here.
> _value = _index;
> return;
> }
>
> _value = 0;
> foreach (i; 0..10) { // Line 23
> auto e = new MyFiber(_depth+1, _index * 10 + i);
> e.call();
> _value += e._value;
> }
> }
> }
>
> unittest {
> import std.stdio : writeln;
> import std.datetime.stopwatch : StopWatch, AutoStart;
> auto sw = StopWatch(AutoStart.yes);
> auto a = new MyFiber(0, 0);
> a.call();
> sw.stop();
> assert(a._value == 499999500000);
> writeln(a._value, " after ", sw.peek);
> }
>
>
>> And how do I parallelize this over multiple worker threads?
>> AFAICT fibers are by default all spawned in the same main
>> thread, right?
>
> True. Well, they're not really spawned on any thread - they're
> allocated on the heap, have their own stack, and are run on
> whichever thread happens to invoke their call() method.
>
> I experimented a little bit with parallelism, and the easiest
> definitely is to replace line 23 with this:
>
> foreach (i; taskPool.parallel(10.iota, 1)) {
>
> It seems to make very little difference in terms of run time,
> though. I tried using a mix of these approaches - parallel at
> low depth, basically just to fill up the cores, and serial
> closer to the leaves. The difference is still negligible, so I
> assume the losses are elsewhere.
>
> --
> Biotronic
I ran this under linux perf, and here is top from 'perf report'
# Overhead Command Shared Object Symbol
# ........ ....... ..................
...........................................................................................
#
7.34% t [kernel.kallsyms] [k] clear_page
6.80% t [kernel.kallsyms] [k] __do_page_fault
5.39% t [kernel.kallsyms] [k] __lock_text_start
3.90% t t [.] nothrow
core.thread.Fiber core.thread.Fiber.__ctor(void delegate(), ulong)
3.73% t [kernel.kallsyms] [k] unmap_page_range
3.32% t [kernel.kallsyms] [k] flush_tlb_mm_range
2.70% t [kernel.kallsyms] [k] _raw_spin_lock
2.57% t libpthread-2.23.so [.] pthread_mutex_unlock
2.53% t t [.] nothrow void
core.thread.Fiber.__dtor()
So looks like memory management, even not GC, take most of the
time (if I interpret these numbers correctly)
More information about the Digitalmars-d-learn
mailing list