inline asm in inlined function / ECX clobbered / stack frame / naked
James Blachly
james.blachly at gmail.com
Mon May 6 03:09:38 UTC 2019
I am posting here instead of learn because it is my understanding that
DMD will not inline a fn that has inline asm, and because one of my
questions relates to why LLVM is unsafely(?) using ECX.
Problem Statement: I am writing a faster round-up-to-nearest-power-of-2
function. An inline asm [1] version using BSR instruction is about twice
as fast as (--(x), (x)|=(x)>>1, (x)|=(x)>>2, (x)|=(x)>>4, (x)|=(x)>>8,
(x)|=(x)>>16, ++(x)).[2]
I know about core.bitop.bsr and std.math.nextPow2 which uses it. My asm
code block is 2.5x faster than codegen for (2 << bsr(x)) which surprises
me...
Summary: I've inlined a function consisting of mostly inline assembly.
In the initial approach ECX is clobbered but not recognized by LDC/LLVM
which tries to use it as loop iter. In approach 2, I pushed RCX to
stack, LDC/LLVM sees me use RCX(?) and uses EDX instead, but now the
inline asm referencing stack frame variable (x[EBP] or [EBP + x] -- side
note, should that be RBP?) is off by 8 bytes, which I can rescue by
manually writing [EBP + x + 8] which seems hinky and wrong. By moving
the PUSH after EAX loaded from stack, all is in order.
Questions at bottom.
Results below manifest with ldc2 -release -O2 (version 1.15.0)
Link to compiler explorer: https://godbolt.org/z/uLaIS5
Approach 1:
uint roundup32(uint x)
{
pragma(inline, true);
version(LDC) pragma(LDC_allow_inline);
if (x <= 2) return x;
asm
{
mov EAX, x[EBP];
sub EAX, 1;
bsr ECX, EAX; // ecx = y = msb(x-1)
mov EAX, 2;
shl EAX, CL; // return (2 << y)
}
} // returns EAX
}
Result 1:
This works well _WHEN FN NOT INLINED_.
When inlined, my code clobbers ECX which calling function was using to
track loop iteration.
.LBB1_1:
mov dword ptr [rsp + 8], ecx ; x param == loop iter
mov eax, dword ptr [rsp + 8] ; x param
sub eax, 1
bsr ecx, eax
mov eax, 2
shl eax, cl ; inline asm block ends here
mov eax, eax ; alignment?
add rbx, rax ; rbx == accumulator
add ecx, 1 ; next round for loop, but ecx was chgd
cmp ecx, (# of loop iterations)
jne .LBB1_1
So, perhaps I can just save ECX.
Approach 2:
Same, but PUSH RCX / POP RCX at beginning and end of the asm block.
Result 2:
compiler detects I have used RCX and instead uses EDX as loop counter.
New problem: since I have pushed on to stack, the param x is at offset
rbp + 16, but the generated code still b elieves it is at rbp + 8:
.LBB1_1:
mov dword ptr [rsp + 8], edx ; x param == loop iter
push rcx ; approach 2
mov eax, dword ptr [rsp + 8] ; looking for x here :-(
sub eax, 1
bsr ecx, eax
mov eax, 2
shl eax, cl
pop rcx ; inline asm blocks ends here
mov eax, eax ; alignment?
add rbx, rax ; rbx == accumulator
add edx, 1 ; next round for loop
cmp edx, (# of loop iterations)
jne .LBB1_1
So now because we are looking in the wrong spot on the stack, the
function has the wrong value in eax.
Approach 3:
Cheat, because I know RCX was pushed, and change the local reference to
x from [RBP + x] to [RBP + x + 8].
Result 3:
This works, but feels wrong.
Final, successful approach:
I can just move the PUSH RCX until after EAX was loaded from [rsp + 8]
and not worry about the altered stack frame.
Questions:
0. Should I quit using DMD style assembly and use LLVM style?
1. gcc inline asm has the capability to specify in/out values and what's
clobbered. I assume this is a deficiency of DMD style inline asm?
2. LDC/LLVM can inline this function which eliminates prolog/epliogue,
but still has a local stack frame consisting of the "passed" value x.
What is going on -- This must be typical behavior when function inlined,
yes?
3. Even though it is essentially a naked function, I cannot add naked;
because then I can only access variables by name from the global scope.
Why, when since it is inlined I can still access the "passed" x ?
3b. Is an inlined function de facto "naked" meaning there is no need for
this keyword?
4a. How can I notify LDC/LLVM that ECX is being used OR: why does it not
notice in approach 1, but it does appear to in approach 2?
4b. Cdecl says ECX is a caller-saved register [3]; does this go out the
window when function is inlined? In this case, how can I be sure it is
safe to use EAX, ECX, or EDX in an inline asm block in in a function
that will be inlined?
(Interestingly when function not inlined, EBX is used to track loop iter
since it knows ECX could be clobbered)
5. Is the final solution of moving the PUSH RCX until after EAX loaded
from the stack the correct approach?
This was quite lengthy; As a novice **I am very grateful for your expert
help.**
James
References:
[1] DMD style assembly: https://dlang.org/spec/iasm.html
[2] You can read more about round up optimization here:
http://locklessinc.com/articles/next_pow2/ (although note that in this
very old article, the bitshift trick was the fastest whereas on modern
processors the BSR instruction method is faster)
[3] https://www.agner.org/optimize/calling_conventions.pdf
More information about the digitalmars-d-ldc
mailing list