inline asm in inlined function / ECX clobbered / stack frame / naked

Mon May 6 03:09:38 UTC 2019

I am posting here instead of learn because it is my understanding that 
DMD will not inline a fn that has inline asm, and because one of my 
questions relates to why LLVM is unsafely(?) using ECX.

Problem Statement: I am writing a faster round-up-to-nearest-power-of-2 
function. An inline asm [1] version using BSR instruction is about twice 
as fast as (--(x), (x)|=(x)>>1, (x)|=(x)>>2, (x)|=(x)>>4, (x)|=(x)>>8, 
(x)|=(x)>>16, ++(x)).[2]

I know about core.bitop.bsr and std.math.nextPow2 which uses it. My asm 
code block is 2.5x faster than codegen for (2 << bsr(x)) which surprises 
me...

Summary: I've inlined a function consisting of mostly inline assembly. 
In the initial approach ECX is clobbered but not recognized by LDC/LLVM 
which tries to use it as loop iter. In approach 2, I pushed RCX to 
stack, LDC/LLVM sees me use RCX(?) and uses EDX instead, but now the 
inline asm referencing stack frame variable (x[EBP] or [EBP + x] -- side 
note, should that be RBP?) is off by 8 bytes, which I can rescue by 
manually writing [EBP + x + 8] which seems hinky and wrong. By moving 
the PUSH after EAX loaded from stack, all is in order.

Questions at bottom.

Results below manifest with ldc2 -release -O2 (version 1.15.0)

Link to compiler explorer: https://godbolt.org/z/uLaIS5

Approach 1:
uint roundup32(uint x)
{
     pragma(inline, true);
     version(LDC) pragma(LDC_allow_inline);

     if (x <= 2) return x;
         asm
         {
             mov EAX, x[EBP];
             sub EAX, 1;
             bsr ECX, EAX;   // ecx = y = msb(x-1)
             mov EAX, 2;
             shl EAX, CL;    // return (2 << y)
         }
    }    // returns EAX
}

Result 1:
This works well _WHEN FN NOT INLINED_.

When inlined, my code clobbers ECX which calling function was using to 
track loop iteration.

.LBB1_1:
         mov     dword ptr [rsp + 8], ecx	; x param == loop iter
         mov     eax, dword ptr [rsp + 8]	; x param
         sub     eax, 1
         bsr     ecx, eax
         mov     eax, 2
         shl     eax, cl			; inline asm block ends here
         mov     eax, eax		; alignment?
         add     rbx, rax		; rbx == accumulator
         add     ecx, 1		; next round for loop, but ecx was chgd
         cmp     ecx, (# of loop iterations)
         jne     .LBB1_1

So, perhaps I can just save ECX.

Approach 2:
Same, but PUSH RCX / POP RCX at beginning and end of the asm block.

Result 2:
compiler detects I have used RCX and instead uses EDX as loop counter.
New problem: since I have pushed on to stack, the param x is at offset 
rbp + 16, but the generated code still b elieves it is at rbp + 8:

.LBB1_1:
         mov     dword ptr [rsp + 8], edx	; x param == loop iter
         push    rcx				; approach 2
         mov     eax, dword ptr [rsp + 8]	; looking for x here :-(
         sub     eax, 1
         bsr     ecx, eax
         mov     eax, 2
         shl     eax, cl
         pop     rcx			; inline asm blocks ends here
         mov     eax, eax		; alignment?
         add     rbx, rax		; rbx == accumulator
         add     edx, 1		; next round for loop
         cmp     edx, (# of loop iterations)
         jne     .LBB1_1

So now because we are looking in the wrong spot on the stack, the 
function has the wrong value in eax.

Approach 3:
Cheat, because I know RCX was pushed, and change the local reference to 
x from [RBP + x] to [RBP + x + 8].

Result 3:
This works, but feels wrong.

Final, successful approach:
I can just move the PUSH RCX until after EAX was loaded from [rsp + 8] 
and not worry about the altered stack frame.

Questions:
0. Should I quit using DMD style assembly and use LLVM style?

1. gcc inline asm has the capability to specify in/out values and what's 
clobbered. I assume this is a deficiency of DMD style inline asm?

2. LDC/LLVM can inline this function which eliminates prolog/epliogue, 
but still has a local stack frame consisting of the "passed" value x. 
What is going on -- This must be typical behavior when function inlined, 
yes?

3. Even though it is essentially a naked function, I cannot add naked; 
because then I can only access variables by name from the global scope. 
Why, when since it is inlined I can still access the "passed" x ?
3b. Is an inlined function de facto "naked" meaning there is no need for 
this keyword?

4a. How can I notify LDC/LLVM that ECX is being used OR: why does it not 
notice in approach 1, but it does appear to in approach 2?
4b. Cdecl says ECX is a caller-saved register [3]; does this go out the 
window when function is inlined? In this case, how can I be sure it is 
safe to use EAX, ECX, or EDX in an inline asm block in in a function 
that will be inlined?
(Interestingly when function not inlined, EBX is used to track loop iter 
since it knows ECX could be clobbered)

5. Is the final solution of moving the PUSH RCX until after EAX loaded 
from the stack the correct approach?

This was quite lengthy; As a novice **I am very grateful for your expert 
help.**

James

References:
[1] DMD style assembly: https://dlang.org/spec/iasm.html

[2] You can read more about round up optimization here: 
http://locklessinc.com/articles/next_pow2/ (although note that in this 
very old article, the bitshift trick was the fastest whereas on modern 
processors the BSR instruction method is faster)

[3] https://www.agner.org/optimize/calling_conventions.pdf