inline asm in inlined function / ECX clobbered / stack frame / naked
James Blachly
james.blachly at gmail.com
Tue May 7 01:20:43 UTC 2019
On 5/6/19 3:55 PM, kinke wrote:
> Adding `-mcpu=haswell` to the LDC command-line, the generated asm is:
>
> addl $-1, %edi
> lzcntl %edi, %eax
> xorb $31, %al
> movl $2, %ecx
> shlxl %eax, %ecx, %eax
>
> [Add `-mcpu=native` to tune code-gen for *your* CPU.]
Thanks kinke. Agree, I am all for portability. This is my first stab at asm.
The speed is not really the point; this is really me asking for help
understanding compiler internals and fundamentals of inline assembly.
Nevertheless, running each of 3 algorithms (asm, famous bitwise OR ops
for this problem, and std.bitop.bsr) 2^28 times I get the following results:
assembly version:
Elapsed msec: 564
Sum: 48038395756849835
Stopwatch is reset: 0
kroundup32:
Elapsed msec: 776
Sum: 48038395756849835
Stopwatch is reset: 0
bitop_bsr:
Elapsed msec: 1299
Sum: 48038395756849835
-mcpu=native:
The performance of bitop_bsr is markedly improved when using
-mcpu=native on linux (but only marginally better on MacOS) which is
really surprising(???) but still slower than the hand asm, about 2.25x
the runtime from 2.5x on ivybridge (Mac) and 2.9x on sandybridge
(linux). Maybe because I do not have the latest and greatest processors?
Assembly at bottom. The two xor instructions after bsr are
unnecessary(?) and perhaps a contributor to slowdown.
Anyway, mostly I wanted help understanding register safety and proper
use of inline asm.
Of note, looking at the .s output, on MacOS the inline asm {} block is
flagged in the assembly source file with ## InlineAsm Start ...
##InlineAsm End, whereas under linux, it is flagged with #APP ...
#NO_APP. Since this is compiler-emitted I would have expected it to be
the same across platforms -- or is this determined by LLVM version?
James
Assembly with -O2 -release -mcpu=native
Note that these are the non-inlined version.
Also not sure why roundup32(uint x) has so much more prologue than
bitop_bsr(uint x).
_D9asminline9bitop_bsrFkZk:
.cfi_startproc
cmpl $2, %edi
ja .LBB0_2
movl %edi, %eax
retq
.LBB0_2:
addl $-1, %edi
bsrl %edi, %ecx
xorl $31, %ecx
xorb $31, %cl
movl $2, %eax
shll %cl, %eax
retq
---
_D9asminline9roundup32FkZk:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
movl %edi, -4(%rbp)
cmpl $2, %edi
ja .LBB1_2
movl %edi, %eax
popq %rbp
.cfi_def_cfa %rsp, 8
retq
.LBB1_2:
.cfi_def_cfa %rbp, 16
#APP
movl -4(%rbp), %eax
subl $1, %eax
pushq %rcx
bsrl %eax, %ecx
movl $2, %eax
shll %cl, %eax
popq %rcx
#NO_APP
popq %rbp
.cfi_def_cfa %rsp, 8
retq
More information about the digitalmars-d-ldc
mailing list