Simple features that I've always missed from C...
Don
nospam at nospam.com
Thu Oct 20 02:18:00 PDT 2011
On 19.10.2011 10:13, Manu wrote:
> Nicely spotted, I didn't realise the intel/amd distinction ;)
>
> Unless I'm mistaken, it is possible for D to return 'out' parameters by
> value right? (in additional return registers, no touching the stack?) ..
> Assuming that's the case you would surely standardise something more
> like the win32 intrinsic rather than one resembling the PPC opcode.
> If the function returns a bool that the value was zero or not, then I
> think it's fair to say the position is undefined (which supports the
> intel assertion).
>
> PPC's approach is more cleanly factored into the win32 model than the
> other way around I think, in terms of allowing the optimiser to trim the
> unused code. If the intrinsic generates implicit code to produce a bool
> from the value, it will surely be trimmed by the optimiser if that
> result is not used.
>
> While cmov might work nicely (although I really don't trust that opcode
> anyway, an intrinsic like bsr shouldn't be producing a hidden branch) on
> x86 to produce the PPC result, I'm not sure other architectures would
> have such a simple solution.
Most other architectures that I know of, use lzcnt instead.
On AMD64 (not Intel) and on ARM, there's an LZCNT resp. CLZ instruction,
which gives:
lzcnt(x) = x? 63-bsr(x) : 64;
Here's how it could be done:
RAX lzcnt(EBX)
{
bsr RAX, RBX;
cmovz RAX, -1
xor RAX, 63;
}
> Again, I think the win32 approach is easier
> for all architectures to produce and for the optimiser to truncate if
> the calculated result is unused.
>
> bool bsf/bsr(int value, out int position); // this assumes that position
> will cleanly return in a second return register...
Seems to be equivalent to replacing the bsr with a comma expression:
(position = native_bsr(value), value == 0)
Do we really gain much by this?
The more painful signature of the function somewhat discourages users
from calling it with a zero value, but the undefined position is still
exposed. So the original problem of undefined behaviour remains.
There's maybe a performance improvement in the fairly rare case where
there's a branch on zero value. Although theoretically, in existing code
the optimizer could check for the sequence:
bsr dest, src
cmp src, 0 where only Z flag is required
and remove the cmp, so I don't think the performance aspect should be
rated very highly.
It's a bit of a problem that AMD's bsf and bsr are so slow. They're
really slow on Pentium 4 and Atom as well.
Interestingly AMD's lzcnt is faster than their bsr. But since Intel
doesn't support it, it's pretty useless outside of inline asm.
I think we need to do a survey of as many architectures as possible,
before we can decide what to do. As far as I know, bsr/bsf is unique to
x86. If this is true, then bsf/bsr should probably be wrapped in
version(x86), and discouraged from general use. A portable function
(perhaps leadz, trailz) would need to provided as well, and recommended
for general use.
More information about the Digitalmars-d
mailing list