Simple features that I've always missed from C...

Thu Oct 20 02:18:00 PDT 2011

On 19.10.2011 10:13, Manu wrote:
> Nicely spotted, I didn't realise the intel/amd distinction ;)
>
> Unless I'm mistaken, it is possible for D to return 'out' parameters by
> value right? (in additional return registers, no touching the stack?) ..
> Assuming that's the case you would surely standardise something more
> like the win32 intrinsic rather than one resembling the PPC opcode.
> If the function returns a bool that the value was zero or not, then I
> think it's fair to say the position is undefined (which supports the
> intel assertion).
 >
> PPC's approach is more cleanly factored into the win32 model than the
> other way around I think, in terms of allowing the optimiser to trim the
> unused code. If the intrinsic generates implicit code to produce a bool
> from the value, it will surely be trimmed by the optimiser if that
> result is not used.
>
> While cmov might work nicely (although I really don't trust that opcode
> anyway, an intrinsic like bsr shouldn't be producing a hidden branch) on
> x86 to produce the PPC result, I'm not sure other architectures would
> have such a simple solution.

Most other architectures that I know of, use lzcnt instead.

On AMD64 (not Intel) and on ARM, there's an LZCNT resp. CLZ instruction, 
which gives:
lzcnt(x) = x? 63-bsr(x) : 64;

Here's how it could be done:

RAX lzcnt(EBX)
{
   bsr RAX, RBX;
   cmovz RAX, -1
   xor RAX, 63;
}

> Again, I think the win32 approach is easier
> for all architectures to produce and for the optimiser to truncate if
> the calculated result is unused.
>
> bool bsf/bsr(int value, out int position); // this assumes that position
> will cleanly return in a second return register...

Seems to be equivalent to replacing the bsr with a comma expression:

(position = native_bsr(value), value == 0)

Do we really gain much by this?
The more painful signature of the function somewhat discourages users 
from calling it with a zero value, but the undefined position is still 
exposed. So the original problem of undefined behaviour remains.
There's maybe a performance improvement in the fairly rare case where 
there's a branch on zero value. Although theoretically, in existing code 
the optimizer could check for the sequence:
bsr dest, src
cmp src, 0   where only Z flag is required

and remove the cmp, so I don't think the performance aspect should be 
rated very highly.

It's a bit of a problem that AMD's bsf and bsr are so slow. They're 
really slow on Pentium 4 and Atom as well.

Interestingly AMD's lzcnt is faster than their bsr. But since Intel 
doesn't support it, it's pretty useless outside of inline asm.

I think we need to do a survey of as many architectures as possible, 
before we can decide what to do. As far as I know, bsr/bsf is unique to 
x86. If this is true, then bsf/bsr should probably be wrapped in 
version(x86), and discouraged from general use. A portable function 
(perhaps leadz, trailz) would need to provided as well, and recommended 
for general use.