inline asm in inlined function / ECX clobbered / stack frame / naked
kinke
noone at nowhere.com
Wed May 8 14:37:29 UTC 2019
On Tuesday, 7 May 2019 at 01:20:43 UTC, James Blachly wrote:
> The performance of bitop_bsr is markedly improved when using
> -mcpu=native on linux (but only marginally better on MacOS)
> which is really surprising(???) but still slower than the hand
> asm, about 2.25x the runtime from 2.5x on ivybridge (Mac) and
> 2.9x on sandybridge (linux).
I still couldn't believe this, so benchmarked myself.
First observation: you didn't re-start() the StopWatch after
resetting it; the next stop() will then measure the elapsed time
since it was originally started (!).
I modified the code a bit, switching to LLVM inline asm (highly
recommended if DMD compatibility isn't that important) and adding
a nextPow2-based version:
```
enum MAXITER = 1 << 28;
pragma(inline, true):
uint roundup32(uint x)
{
if (x <= 2) return x;
import ldc.llvmasm;
return __asm!uint(
`bsrl %eax, %ecx
#xorl $$31, %ecx
#xorb $$31, %cl
movl $$2, %eax
shll %cl, %eax`,
"={eax},{eax},~{ecx}", x - 1);
}
uint bitop_bsr(uint x)
{
import core.bitop;
return x <= 2 ? x : 2 << bsr(x - 1);
}
uint nextPow2(uint x)
{
import std.math;
return x <= 2 ? x : std.math.nextPow2(x - 1);
}
uint kroundup32(uint x)
{
x -= 1;
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
return x + 1;
}
void main()
{
static void benchmark(alias Func)()
{
import std.datetime.stopwatch;
import std.stdio;
ulong sum;
auto sw = StopWatch(AutoStart.yes);
for(uint i; i < MAXITER; i++)
sum += Func(i);
sw.stop();
writeln(__traits(identifier, Func), ":\t",
sw.peek.total!"msecs", "ms\t", sum);
}
benchmark!roundup32();
benchmark!bitop_bsr();
benchmark!nextPow2();
benchmark!kroundup32();
}
```
My results with an Intel Ivy Bridge CPU:
ldc2 -O -run asm.d:
roundup32: 254ms 48038395756849835
bitop_bsr: 386ms 48038395756849835
nextPow2: 381ms 48038395756849835
kroundup32: 326ms 48038395756849835
Observation 1: the inline asm version saving the 2 XORs is indeed
faster, but by about 50% and not more than twice as fast. This
was surprising to me; checking
https://gmplib.org/~tege/x86-timing.pdf, BSR should take a
surprisingly low 3 cycles vs. 5 cycles with the 2 additional
dependent XORs on my Ivy Bridge.
Observation 2: std.math.nextPow2() is fine (equivalent to
bitop_bsr).
Observation 3: kroundup32 is faster than nextPow2/bitop_bsr. With
`-mcpu=native`, the first 3 timings are unchanged in my case, but
the kroundup32 runtime shrinks to ~140ms. By then, it was clear
that auto-vectorization must be interfering, and a look at the
(inlined) asm confirmed it. Adding `-disable-loop-vectorization`
to the cmdline, the kroundup32 time is exactly the same as
nextPow2/bitop_bsr, ~380ms, independent from `-mcpu=native`.
The x86 timings PDF also shows that nextPow2 should be faster
than the inline asm version on an AMD Zen CPU with tuned codegen,
as BSR takes 4 cycles there, and LZCNT+XOR only 2 cycles. So
whether inline asm really pays off in this case is highly
CPU/codegen specific. It kills auto-vectorization for sure
though, while nextPow2 may be vectorizable for some non-x86
targets. If the real-world loop body is vectorizable, then
kroundup32 may be significantly faster than the other versions...
More information about the digitalmars-d-ldc
mailing list