Counting bits in a ulong

Thu Jun 4 09:30:00 UTC 2020

On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:
> I did some measurements with real data (not just a micro 
> benchmark, this is the actual algorithm with the bitcount 
> function in its hotspot), and as expected the naïve bitcount 
> performed the worst, about 2x slower. However, between 
> Kernighan and the parallel bit counter, the results depend on 
> the kind of input given.  For program inputs where there is a 
> large number of 1's in the average ulong to be counted, the 
> parallel bit counter performs better. Here's one example of 
> such an input case:
>
> 	Naïve:
> 	real	0m4.159s
> 	user	0m4.172s
> 	sys	0m0.024s
>
> 	Kernighan:
> 	real	0m2.114s
> 	user	0m2.129s
> 	sys	0m0.020s
>
> 	Parallel:
> 	real	0m2.024s
> 	user	0m2.039s
> 	sys	0m0.028s
>
> As you can see, the advantage of the parallel count is only 
> slightly (albeit consistently) better than Kernighan.

Did you run the benchmark on different CPU architectures ?

I recently benchmarked code which was performing a calculation 
one of 2 ways.
1st way was to calculate for real, the other used look up tables.
I ran the test on an i7 haswell and an i5 skylake.
The i5 was performing faster with look up tables, the i7 
performance was worse using look up tables.
The performance impact was similar to your 1st benchmark Parallel 
vs Kernighan.