Counting bits in a ulong

H. S. Teoh hsteoh at quickfur.ath.cx
Thu Jun 4 13:40:25 UTC 2020


On Thu, Jun 04, 2020 at 01:27:39AM +0000, Paul Backus via Digitalmars-d wrote:
[...]
> Interesting stuff. Did you compare your implementations with the
> popcnt function in core.bitop? There could be a potential improvement
> here.

On Thu, Jun 04, 2020 at 07:42:07AM +0000, Patrick Schluter via Digitalmars-d wrote:
[...]
> You should try to use the popcnt instruction of the cpu (in asm or
> with intrinsics). It's at least 3 or 4 order of magnitude (i.e.
> between 1000 and 10000 times) faster than any of these tricks. I made
> the change last year in my work codebase since we've left the SPARC
> servers (they had a very slow implementation of popcount).
> All modern CPU's have the instruction.

Haha, I learn something new every day. Didn't even know there was such a
CPU instruction as popcnt. :-)

But what's more interesting is the result of using core.bitop.popcnt: as
it turns out, it runs faster than Kernighan when the input ulongs have
high bit count, but *slower* than Kernighan when the inputs have low bit
density. This was very puzzling to me, since a CPU instruction is
supposed to be orders of magnitude faster, as Patrick said, so I looked
at the disassembly.

As it turns out, my CPU (AMD Phenom II X6 1055T) doesn't support the
instruction. :-(  So the LDC intrinsic reverted to a software
implementation, which is essentially the parallel bitcount I wrote,
except better. Evidently, I miscalculated the impact of data dependency
between instructions, or more likely, the AMD CPU doesn't have
hyperthreading so splitting the ulong into upper/lower halves did more
harm than good for the calculation.

But nonetheless, even the better implementation of parallel bitcount
loses out to Kernighan when the input does not have a high bit density.

I do have a VPS that runs on an Intel CPU, but apparently also an older
one that doesn't support SSE4 (either that, or the hypervisor filtered
that out for whatever reason, but I doubt this).

So again we see the complex landscape of optimization: not only the
optimality depend on input data, it also depends on the hardware. :-)


T

-- 
Век живи - век учись. А дураком помрёшь.


More information about the Digitalmars-d mailing list