inline asm in inlined function / ECX clobbered / stack frame / naked

Wed May 8 14:37:29 UTC 2019

On Tuesday, 7 May 2019 at 01:20:43 UTC, James Blachly wrote:
> The performance of bitop_bsr is markedly improved when using 
> -mcpu=native on linux (but only marginally better on MacOS) 
> which is really surprising(???) but still slower than the hand 
> asm, about 2.25x the runtime from 2.5x on ivybridge (Mac) and 
> 2.9x on sandybridge (linux).

I still couldn't believe this, so benchmarked myself.
First observation: you didn't re-start() the StopWatch after 
resetting it; the next stop() will then measure the elapsed time 
since it was originally started (!).

I modified the code a bit, switching to LLVM inline asm (highly 
recommended if DMD compatibility isn't that important) and adding 
a nextPow2-based version:

```
enum MAXITER = 1 << 28;

pragma(inline, true):

uint roundup32(uint x)
{
     if (x <= 2) return x;

     import ldc.llvmasm;
     return __asm!uint(
         `bsrl %eax, %ecx
          #xorl $$31, %ecx
          #xorb $$31, %cl
          movl $$2, %eax
          shll %cl, %eax`,
         "={eax},{eax},~{ecx}", x - 1);
}

uint bitop_bsr(uint x)
{
     import core.bitop;
     return x <= 2 ? x : 2 << bsr(x - 1);
}

uint nextPow2(uint x)
{
     import std.math;
     return x <= 2 ? x : std.math.nextPow2(x - 1);
}

uint kroundup32(uint x)
{
     x -= 1;
     x |= (x >> 1);
     x |= (x >> 2);
     x |= (x >> 4);
     x |= (x >> 8);
     x |= (x >> 16);
     return x + 1;
}

void main()
{
     static void benchmark(alias Func)()
     {
         import std.datetime.stopwatch;
         import std.stdio;

         ulong sum;

         auto sw = StopWatch(AutoStart.yes);
         for(uint i; i < MAXITER; i++)
             sum += Func(i);
         sw.stop();

         writeln(__traits(identifier, Func), ":\t", 
sw.peek.total!"msecs", "ms\t", sum);
     }

     benchmark!roundup32();
     benchmark!bitop_bsr();
     benchmark!nextPow2();
     benchmark!kroundup32();
}
```

My results with an Intel Ivy Bridge CPU:

ldc2 -O -run asm.d:
roundup32:      254ms   48038395756849835
bitop_bsr:      386ms   48038395756849835
nextPow2:       381ms   48038395756849835
kroundup32:     326ms   48038395756849835

Observation 1: the inline asm version saving the 2 XORs is indeed 
faster, but by about 50% and not more than twice as fast. This 
was surprising to me; checking 
https://gmplib.org/~tege/x86-timing.pdf, BSR should take a 
surprisingly low 3 cycles vs. 5 cycles with the 2 additional 
dependent XORs on my Ivy Bridge.

Observation 2: std.math.nextPow2() is fine (equivalent to 
bitop_bsr).

Observation 3: kroundup32 is faster than nextPow2/bitop_bsr. With 
`-mcpu=native`, the first 3 timings are unchanged in my case, but 
the kroundup32 runtime shrinks to ~140ms. By then, it was clear 
that auto-vectorization must be interfering, and a look at the 
(inlined) asm confirmed it. Adding `-disable-loop-vectorization` 
to the cmdline, the kroundup32 time is exactly the same as 
nextPow2/bitop_bsr, ~380ms, independent from `-mcpu=native`.

The x86 timings PDF also shows that nextPow2 should be faster 
than the inline asm version on an AMD Zen CPU with tuned codegen, 
as BSR takes 4 cycles there, and LZCNT+XOR only 2 cycles. So 
whether inline asm really pays off in this case is highly 
CPU/codegen specific. It kills auto-vectorization for sure 
though, while nextPow2 may be vectorizable for some non-x86 
targets. If the real-world loop body is vectorizable, then 
kroundup32 may be significantly faster than the other versions...