Low hanging fruit for optimizing loops
Juan Manuel Cabo
juanmanuel.cabo at gmail.com
Fri Jun 7 23:11:04 PDT 2013
On Saturday, 8 June 2013 at 05:11:11 UTC, Walter Bright wrote:
> On 6/7/2013 9:15 PM, Juan Manuel Cabo wrote:
>> Given the recent surge in interest for performance, I dusted
>> off a small test that I made long ago and determined myself
>> to find the cause of the performance difference.
>
> It's great that you're doing this. You can track it down
> further by using inline assembler and trying different
> instruction sequences.
>
> Also, obj2asm gives nicer disassembly :-)
Thanks!!
I now used inline assembler, and can confidently say
that the difference is because of the alignment.
Changing the order of the cmp relative to the
increment didn't do anything.
Adding the right amount of 'nop' makes it run in
957 ms, 921 μs, and 4 hnsecs
But if I overshoot it, or miss one, it goes back to
1 sec, 438 ms, and 544 μs
Also, I couldn't use this instruction in D's asm{}
0f 1f 40 00 nop DWORD PTR [rax+0x0]
and obj2asm doesn't dissasemble it (it just puts "0f1f"
and gives incorrent asm for the next few instructions).
I'm now not entirely sure that aligning loop jumps would be
worthwhile though. They would have to be "leaf" loops
because any call made inside the loop would overshadow
the benefits (I was looping millons of times in my test).
Anyway, here is the new source:
import std.stdio;
import std.datetime;
int fiba(int n) {
asm {
naked;
push RBP;
mov RBP,RSP;
mov RCX,RDI;
mov ESI,0x1;
mov EAX,0x1;
mov EDX,0x2;
cmp ECX,0x2;
jl EXIT_LOOP;
nop;
nop; nop; nop; nop;
nop; nop; nop; nop;
nop; nop; nop; nop;
LOOP_START:
lea EDI,[RSI+RAX*1];
mov RSI,RAX;
mov RAX,RDI;
inc EDX;
cmp EDX,ECX;
jle LOOP_START;
EXIT_LOOP:
pop RBP;
ret;
}
}
void main() {
auto start = Clock.currTime();
int r = fiba(1000_000_000);
auto elapsed = Clock.currTime() - start;
writeln(r);
writeln(elapsed);
}
--jm
More information about the Digitalmars-d
mailing list