Counting bits in a ulong

H. S. Teoh hsteoh at quickfur.ath.cx
Thu Jun 4 01:16:13 UTC 2020


So recently I needed to implement a function to count the number of 1
bits in a ulong inside a (very) hot inner loop.  There's of course the
naïve method (`val` is the input ulong):

        uint count;
        foreach (i; 0 .. 64)
        {
            if (val & (1L << i))
                count++;
        }
        return count;

Knowing that this code is inside a hotspot, I wrote a parallel bit
counter based on the idea from [1]:

        // Parallel bit count.
        // We count upper/lower halves separately, interspersed to maximize
        // hyperthreading. >:-)
        uint v1 = cast(uint)(val >> 32);
        uint v2 = cast(uint)val & 0xFFFF_FFFF;

        v1 = v1 - ((v1 >> 1) & 0x5555_5555);
        v2 = v2 - ((v2 >> 1) & 0x5555_5555);
        v1 = ((v1 >> 2) & 0x3333_3333) + (v1 & 0x3333_3333);
        v2 = ((v2 >> 2) & 0x3333_3333) + (v2 & 0x3333_3333);
        v1 = ((v1 >> 4) & 0x0F0F_0F0F) + (v1 & 0x0F0F_0F0F);
        v2 = ((v2 >> 4) & 0x0F0F_0F0F) + (v2 & 0x0F0F_0F0F);
        v1 = ((v1 >> 8) & 0x00FF_00FF) + (v1 & 0x00FF_00FF);
        v2 = ((v2 >> 8) & 0x00FF_00FF) + (v2 & 0x00FF_00FF);
        v1 = ((v1 >> 16) & 0x0000_FFFF) + (v1 & 0x0000_FFFF);
        v2 = ((v2 >> 16) & 0x0000_FFFF) + (v2 & 0x0000_FFFF);

        return cast(ulong)v1 + cast(ulong)v2;

[1] https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel

The basic idea behind this trick is to add bits in pairs, then in 4's,
then in 8's, and so on, inside the ulong itself.  I furthermore split it
into upper and lower halves and interleaved the two halves to cut the
data dependency between instructions and (hopefully) increase the
opportunity for hyperthreading / increase out-of-order instruction
execution.  This is completely branchless and in theory should maximize
throughput. Unfortunately it comes at the cost of high instruction
count.

The third alternative is known as Kernighan's trick, which involves a
loop that executes exactly as many times as the number of 1 bits:

        uint count;
        for (count = 0; val; count++)
        {
            val &= val - 1;
        }
        return count;

(How it does this is left as an exercise for the reader. It's really
quite clever.)  This comes with the advantage of a very low instruction
count, but it does have a branch that isn't easily predictable, so
actual execution time is marginally slower.


I did some measurements with real data (not just a micro benchmark, this
is the actual algorithm with the bitcount function in its hotspot), and
as expected the naïve bitcount performed the worst, about 2x slower.
However, between Kernighan and the parallel bit counter, the results
depend on the kind of input given.  For program inputs where there is a
large number of 1's in the average ulong to be counted, the parallel bit
counter performs better. Here's one example of such an input case:

	Naïve:
	real	0m4.159s
	user	0m4.172s
	sys	0m0.024s

	Kernighan:
	real	0m2.114s
	user	0m2.129s
	sys	0m0.020s

	Parallel:
	real	0m2.024s
	user	0m2.039s
	sys	0m0.028s

As you can see, the advantage of the parallel count is only slightly
(albeit consistently) better than Kernighan.

However, when the input tends to generate ulongs with a low saturation
of 1's, Kernighan wins out by quite a large margin:

	Naïve:
	real	0m5.661s
	user	0m5.706s
	sys	0m0.054s

	Kernighan:
	real	0m3.080s
	user	0m3.123s
	sys	0m0.058s

	Parallel:
	real	0m3.805s
	user	0m3.844s
	sys	0m0.056s

This illustrates how optimization is often input-dependent, and what
works well in one case may work not so well in another case. And even an
overall better choice (in this case Kernighan's trick) will in some
niche cases perform less well.

This goes to reinforce the point Andrei recently made, that sometimes if
you optimize for a micro-benchmark, you can end up with code that looks
good on the benchmark but not so good on real-world data -- you end up
optimizing for the benchmark rather than for the real-world use case.
Real-world optimization isn't so straightforward as minimizing a single
number; it's often a multi-dimensional problem. :-P


T

-- 
Amateurs built the Ark; professionals built the Titanic.


More information about the Digitalmars-d mailing list