core.bitop.bt not faster than & ?

H. S. Teoh via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Wed Dec 17 08:15:09 PST 2014


On Wed, Dec 17, 2014 at 04:08:40PM +0000, Trollgeir via Digitalmars-d-learn wrote:
> On Wednesday, 17 December 2014 at 14:58:13 UTC, Adam D. Ruppe wrote:
> >On Wednesday, 17 December 2014 at 14:12:16 UTC, Trollgeir wrote:
> >>I'd expect the bt function to be up to 32 times faster as I thought
> >>it only compared two bits, and not the entire length of bits in the
> >>uint.
> >
> >The processor doesn't work in terms of bits like that - it still
> >needs to look at the whole integer. In fact, according to my (OLD)
> >asm reference, the bt instruction is slower than the and instruction
> >at the cpu level.
> >
> >I think it has to do a wee bit more work, translating the 16 into a
> >mask then moving the result into the flag... then moving the flag
> >back into a register to return the value. (That last step could
> >probably be skipped if you do an if() on it and the compiler
> >optimizes the branch, and the first step might be skipped too if it
> >is a constant, since the compiler can rewrite the instruction. So
> >given that, I'd expect what you saw: no difference when they are
> >optimized to the same thing or when the CPU's stars align right, and
> >& a bit faster when bt isn't optimized)
> >
> >bt() and friends are special instructions for specialized use cases.
> >Probably useful for threading and stuff.
> 
> 
> Thanks for the explanation, I suspected it might work something like
> that.
> 
> For my implementation - I have bits shifting to the right every
> update, and I want to check if it has reached certain markers. Hence,
> I felt it was really inefficient to check every single bit in the uint
> when I was only interested in some specific ones. Is an alternative
> (optimized) version of bt even possible?

That's an inaccurate understanding of how the CPU works. It does not
check the bits "one by one"; the & operation translates to a single asm
instruction that performs the bitwise-and of *all* bits in parallel.
There is no faster implementation.

What causes the slowdown is probably the conversion of the bit index
into a mask. You could alleviate this part by precomputing the mask (if
you're testing for bits in the same position in many values -- this will
pay for the cost of computing the mask once, and reusing it many times
thereafter, instead of computing it every single time).


T

-- 
Computers shouldn't beep through the keyhole.


More information about the Digitalmars-d-learn mailing list