inline asm in inlined function / ECX clobbered / stack frame / naked

James Blachly james.blachly at gmail.com
Tue May 7 01:20:43 UTC 2019


On 5/6/19 3:55 PM, kinke wrote:
> Adding `-mcpu=haswell` to the LDC command-line, the generated asm is:
> 
>      addl    $-1, %edi
>      lzcntl    %edi, %eax
>      xorb    $31, %al
>      movl    $2, %ecx
>      shlxl    %eax, %ecx, %eax
> 
> [Add `-mcpu=native` to tune code-gen for *your* CPU.]

Thanks kinke. Agree, I am all for portability. This is my first stab at asm.

The speed is not really the point; this is really me asking for help 
understanding compiler internals and fundamentals of inline assembly.

Nevertheless, running each of 3 algorithms (asm, famous bitwise OR ops 
for this problem, and std.bitop.bsr) 2^28 times I get the following results:

assembly version:
Elapsed msec: 564
Sum: 48038395756849835
Stopwatch is reset: 0
kroundup32:
Elapsed msec: 776
Sum: 48038395756849835
Stopwatch is reset: 0
bitop_bsr:
Elapsed msec: 1299
Sum: 48038395756849835

-mcpu=native:
The performance of bitop_bsr is markedly improved when using 
-mcpu=native on linux (but only marginally better on MacOS) which is 
really surprising(???) but still slower than the hand asm, about 2.25x 
the runtime from 2.5x on ivybridge (Mac) and 2.9x on sandybridge 
(linux). Maybe because I do not have the latest and greatest processors?

Assembly at bottom. The two xor instructions after bsr are 
unnecessary(?) and perhaps a contributor to slowdown.

Anyway, mostly I wanted help understanding register safety and proper 
use of inline asm.

Of note, looking at the .s output, on MacOS the inline asm {} block is 
flagged in the assembly source file with ## InlineAsm Start ... 
##InlineAsm End, whereas under linux, it is flagged with #APP ... 
#NO_APP. Since this is compiler-emitted I would have expected it to be 
the same across platforms -- or is this determined by LLVM version?

James


Assembly with -O2 -release -mcpu=native

Note that these are the non-inlined version.
Also not sure why roundup32(uint x) has so much more prologue than 
bitop_bsr(uint x).


_D9asminline9bitop_bsrFkZk:
         .cfi_startproc
         cmpl    $2, %edi
         ja      .LBB0_2
         movl    %edi, %eax
         retq
.LBB0_2:
         addl    $-1, %edi
         bsrl    %edi, %ecx
         xorl    $31, %ecx
         xorb    $31, %cl
         movl    $2, %eax
         shll    %cl, %eax
         retq


---

_D9asminline9roundup32FkZk:
         .cfi_startproc
         pushq   %rbp
         .cfi_def_cfa_offset 16
         .cfi_offset %rbp, -16
         movq    %rsp, %rbp
         .cfi_def_cfa_register %rbp
         movl    %edi, -4(%rbp)
         cmpl    $2, %edi
         ja      .LBB1_2
         movl    %edi, %eax
         popq    %rbp
         .cfi_def_cfa %rsp, 8
         retq
.LBB1_2:
         .cfi_def_cfa %rbp, 16
         #APP
         movl    -4(%rbp), %eax
         subl    $1, %eax
         pushq   %rcx
         bsrl    %eax, %ecx
         movl    $2, %eax
         shll    %cl, %eax
         popq    %rcx
         #NO_APP
         popq    %rbp
         .cfi_def_cfa %rsp, 8
         retq



More information about the digitalmars-d-ldc mailing list