Skynet 1M Fiber microbenchmark in D

Wed Oct 18 13:37:23 UTC 2017

On Wednesday, 18 October 2017 at 11:52:08 UTC, Biotronic wrote:
> On Wednesday, 18 October 2017 at 11:34:57 UTC, Nordlöw wrote:
>> Another thing...how should the synchronization between the 
>> fibers figure out when the total number of fibers have reached 
>> one million?...via an atomic counter fed by reference to the 
>> constructor...or are there better ways? Because I do need an 
>> atomic reference counter here, right?
>
> This is how I did it:
> import core.thread : Fiber;
>
> class MyFiber : Fiber {
>     int _depth;
>     ulong _index;
>     ulong _value;
>
>     this(int depth, ulong index) {
>         super(&run);
>         _depth = depth;
>         _index = index;
>     }
>
>     void run() {
>         if (_depth == 6) { // 10^6 == 1 million, so stop here.
>             _value = _index;
>             return;
>         }
>
>         _value = 0;
>         foreach (i; 0..10) { // Line 23
>             auto e = new MyFiber(_depth+1, _index * 10 + i);
>             e.call();
>             _value += e._value;
>         }
>     }
> }
>
> unittest {
>     import std.stdio : writeln;
>     import std.datetime.stopwatch : StopWatch, AutoStart;
>     auto sw = StopWatch(AutoStart.yes);
>     auto a = new MyFiber(0, 0);
>     a.call();
>     sw.stop();
>     assert(a._value == 499999500000);
>     writeln(a._value, " after ", sw.peek);
> }
>
>
>> And how do I parallelize this over multiple worker threads? 
>> AFAICT fibers are by default all spawned in the same main 
>> thread, right?
>
> True. Well, they're not really spawned on any thread - they're 
> allocated on the heap, have their own stack, and are run on 
> whichever thread happens to invoke their call() method.
>
> I experimented a little bit with parallelism, and the easiest 
> definitely is to replace line 23 with this:
>
> foreach (i; taskPool.parallel(10.iota, 1)) {
>
> It seems to make very little difference in terms of run time, 
> though. I tried using a mix of these approaches - parallel at 
> low depth, basically just to fill up the cores, and serial 
> closer to the leaves. The difference is still negligible, so I 
> assume the losses are elsewhere.
>
> --
>   Biotronic

I ran this under linux perf, and here is top from 'perf report'

# Overhead  Command  Shared Object       Symbol
# ........  .......  ..................  
...........................................................................................
#
      7.34%  t        [kernel.kallsyms]   [k] clear_page
      6.80%  t        [kernel.kallsyms]   [k] __do_page_fault
      5.39%  t        [kernel.kallsyms]   [k] __lock_text_start
      3.90%  t        t                   [.] nothrow 
core.thread.Fiber core.thread.Fiber.__ctor(void delegate(), ulong)
      3.73%  t        [kernel.kallsyms]   [k] unmap_page_range
      3.32%  t        [kernel.kallsyms]   [k] flush_tlb_mm_range
      2.70%  t        [kernel.kallsyms]   [k] _raw_spin_lock
      2.57%  t        libpthread-2.23.so  [.] pthread_mutex_unlock
      2.53%  t        t                   [.] nothrow void 
core.thread.Fiber.__dtor()

So looks like memory management, even not GC, take most of the 
time (if I interpret these numbers correctly)