Counting bits in a ulong
H. S. Teoh
hsteoh at quickfur.ath.cx
Thu Jun 4 01:16:13 UTC 2020
So recently I needed to implement a function to count the number of 1
bits in a ulong inside a (very) hot inner loop. There's of course the
naïve method (`val` is the input ulong):
uint count;
foreach (i; 0 .. 64)
{
if (val & (1L << i))
count++;
}
return count;
Knowing that this code is inside a hotspot, I wrote a parallel bit
counter based on the idea from [1]:
// Parallel bit count.
// We count upper/lower halves separately, interspersed to maximize
// hyperthreading. >:-)
uint v1 = cast(uint)(val >> 32);
uint v2 = cast(uint)val & 0xFFFF_FFFF;
v1 = v1 - ((v1 >> 1) & 0x5555_5555);
v2 = v2 - ((v2 >> 1) & 0x5555_5555);
v1 = ((v1 >> 2) & 0x3333_3333) + (v1 & 0x3333_3333);
v2 = ((v2 >> 2) & 0x3333_3333) + (v2 & 0x3333_3333);
v1 = ((v1 >> 4) & 0x0F0F_0F0F) + (v1 & 0x0F0F_0F0F);
v2 = ((v2 >> 4) & 0x0F0F_0F0F) + (v2 & 0x0F0F_0F0F);
v1 = ((v1 >> 8) & 0x00FF_00FF) + (v1 & 0x00FF_00FF);
v2 = ((v2 >> 8) & 0x00FF_00FF) + (v2 & 0x00FF_00FF);
v1 = ((v1 >> 16) & 0x0000_FFFF) + (v1 & 0x0000_FFFF);
v2 = ((v2 >> 16) & 0x0000_FFFF) + (v2 & 0x0000_FFFF);
return cast(ulong)v1 + cast(ulong)v2;
[1] https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel
The basic idea behind this trick is to add bits in pairs, then in 4's,
then in 8's, and so on, inside the ulong itself. I furthermore split it
into upper and lower halves and interleaved the two halves to cut the
data dependency between instructions and (hopefully) increase the
opportunity for hyperthreading / increase out-of-order instruction
execution. This is completely branchless and in theory should maximize
throughput. Unfortunately it comes at the cost of high instruction
count.
The third alternative is known as Kernighan's trick, which involves a
loop that executes exactly as many times as the number of 1 bits:
uint count;
for (count = 0; val; count++)
{
val &= val - 1;
}
return count;
(How it does this is left as an exercise for the reader. It's really
quite clever.) This comes with the advantage of a very low instruction
count, but it does have a branch that isn't easily predictable, so
actual execution time is marginally slower.
I did some measurements with real data (not just a micro benchmark, this
is the actual algorithm with the bitcount function in its hotspot), and
as expected the naïve bitcount performed the worst, about 2x slower.
However, between Kernighan and the parallel bit counter, the results
depend on the kind of input given. For program inputs where there is a
large number of 1's in the average ulong to be counted, the parallel bit
counter performs better. Here's one example of such an input case:
Naïve:
real 0m4.159s
user 0m4.172s
sys 0m0.024s
Kernighan:
real 0m2.114s
user 0m2.129s
sys 0m0.020s
Parallel:
real 0m2.024s
user 0m2.039s
sys 0m0.028s
As you can see, the advantage of the parallel count is only slightly
(albeit consistently) better than Kernighan.
However, when the input tends to generate ulongs with a low saturation
of 1's, Kernighan wins out by quite a large margin:
Naïve:
real 0m5.661s
user 0m5.706s
sys 0m0.054s
Kernighan:
real 0m3.080s
user 0m3.123s
sys 0m0.058s
Parallel:
real 0m3.805s
user 0m3.844s
sys 0m0.056s
This illustrates how optimization is often input-dependent, and what
works well in one case may work not so well in another case. And even an
overall better choice (in this case Kernighan's trick) will in some
niche cases perform less well.
This goes to reinforce the point Andrei recently made, that sometimes if
you optimize for a micro-benchmark, you can end up with code that looks
good on the benchmark but not so good on real-world data -- you end up
optimizing for the benchmark rather than for the real-world use case.
Real-world optimization isn't so straightforward as minimizing a single
number; it's often a multi-dimensional problem. :-P
T
--
Amateurs built the Ark; professionals built the Titanic.
More information about the Digitalmars-d
mailing list