Counting bits in a ulong

ttk ttk at ciar.org
Fri Jun 5 19:12:43 UTC 2020


On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:
> So recently I needed to implement a function to count the 
> number of 1 bits in a ulong inside a (very) hot inner loop.

This made me wonder how well a lookup table approach would 
compare to these logical methods (and it was an excuse to blow 
off work a bit to fiddle with D).

The popcnt instruction is going to blow away all of these, of 
course, but it still seems like a worthwhile exercise, if only to 
get a feel for how purely in-register logic performance compares 
to cache access performance on modern "big" CPUs.

ttk at kirov:/home/ttk/prog/D$ ./popcount
Starting tests, which take about 10 seconds each
Naive:         5596740 iter/sec
Parallel:      166352970 iter/sec
Kernighan:     47668800 iter/sec
Lookup 8-bit:  467161540 iter/sec
Lookup 16-bit: 826179570 iter/sec

It surprised me a bit that the 16-bit lookup wasn't closer to 
2.0x the performance of the 8-bit lookup, since both lookup 
tables fit well within the L1 cache.

Source is here .. a bit large, since I manually unrolled loops 
100x to minimize the impact of looping logic (particularly the 
now() call):

http://ciar.org/h/popcount.d

I ran this on an Intel Core i7-9750M @ 2.6GHz, with 192KB L1 / 
1.5MB L2 / 12MB L3 caches, compiled with dmd v2.088.1 via "dmd -O 
popcount.d" on Slackware-current.



More information about the Digitalmars-d mailing list