Low hanging fruit for optimizing loops

Fri Jun 7 23:11:04 PDT 2013

On Saturday, 8 June 2013 at 05:11:11 UTC, Walter Bright wrote:
> On 6/7/2013 9:15 PM, Juan Manuel Cabo wrote:
>> Given the recent surge in interest for performance, I dusted
>> off a small test that I made long ago and determined myself
>> to find the cause of the performance difference.
>
> It's great that you're doing this. You can track it down 
> further by using inline assembler and trying different 
> instruction sequences.
>
> Also, obj2asm gives nicer disassembly :-)

Thanks!!

I now used inline assembler, and can confidently say
that the difference is because of the alignment.
    Changing the order of the cmp relative to the
increment didn't do anything.

Adding the right amount of 'nop' makes it run in

       957 ms, 921 μs, and 4 hnsecs

But if I overshoot it, or miss one, it goes back to

       1 sec, 438 ms, and 544 μs

Also, I couldn't use this instruction in D's asm{}

         0f 1f 40 00       nop    DWORD PTR [rax+0x0]

and obj2asm doesn't dissasemble it (it just puts "0f1f"
and gives incorrent asm for the next few instructions).

I'm now not entirely sure that aligning loop jumps would be
worthwhile though. They would have to be "leaf" loops
because any call made inside the loop would overshadow
the benefits (I was looping millons of times in my test).

Anyway, here is the new source:

     import std.stdio;
     import std.datetime;

     int fiba(int n) {
         asm {
             naked;
             push   RBP;
             mov    RBP,RSP;
             mov    RCX,RDI;
             mov    ESI,0x1;
             mov    EAX,0x1;
             mov    EDX,0x2;
             cmp    ECX,0x2;
             jl     EXIT_LOOP;
             nop;
             nop; nop; nop; nop;
             nop; nop; nop; nop;
             nop; nop; nop; nop;
         LOOP_START:
             lea    EDI,[RSI+RAX*1];
             mov    RSI,RAX;
             mov    RAX,RDI;
             inc    EDX;
             cmp    EDX,ECX;
             jle    LOOP_START;
         EXIT_LOOP:
             pop    RBP;
             ret;
         }
     }

     void main() {
         auto start = Clock.currTime();
         int r = fiba(1000_000_000);
         auto elapsed = Clock.currTime() - start;
         writeln(r);
         writeln(elapsed);
     }

--jm