inline asm in inlined function / ECX clobbered / stack frame / naked

Mon May 6 19:08:47 UTC 2019

On Monday, 6 May 2019 at 03:09:38 UTC, James Blachly wrote:
> I know about core.bitop.bsr and std.math.nextPow2 which uses 
> it. My asm code block is 2.5x faster than codegen for (2 << 
> bsr(x)) which surprises me...

Sorry, but I'll just focus on that, and not on the asm questions. 
The reason is simple, I discourage anyone from going down to asm 
level if it can be avoided.

So, I have:

pragma(inline, true)
uint roundup32(uint x)
{
     import core.bitop;
     //if (x <= 2) return x;
     return 2u << bsr(x-1);
}

`ldc2 -mtriple=x86_64-linux-gnu -O -output-s foo.d` (AT&T 
syntax...):

_D3foo9roundup32FkZk:
	addl	$-1, %edi
	bsrl	%edi, %ecx
	xorl	$31, %ecx
	xorb	$31, %cl
	movl	$2, %eax
	shll	%cl, %eax
	retq

I can't believe that's 2.5x slower than your almost identical asm 
block. And that code is portable, not just OS- and 
ABI-independent, but also architecture-wise. 1000x better than 
inline asm IMO.