Counting bits in a ulong
ttk
ttk at ciar.org
Fri Jun 5 19:12:43 UTC 2020
On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:
> So recently I needed to implement a function to count the
> number of 1 bits in a ulong inside a (very) hot inner loop.
This made me wonder how well a lookup table approach would
compare to these logical methods (and it was an excuse to blow
off work a bit to fiddle with D).
The popcnt instruction is going to blow away all of these, of
course, but it still seems like a worthwhile exercise, if only to
get a feel for how purely in-register logic performance compares
to cache access performance on modern "big" CPUs.
ttk at kirov:/home/ttk/prog/D$ ./popcount
Starting tests, which take about 10 seconds each
Naive: 5596740 iter/sec
Parallel: 166352970 iter/sec
Kernighan: 47668800 iter/sec
Lookup 8-bit: 467161540 iter/sec
Lookup 16-bit: 826179570 iter/sec
It surprised me a bit that the 16-bit lookup wasn't closer to
2.0x the performance of the 8-bit lookup, since both lookup
tables fit well within the L1 cache.
Source is here .. a bit large, since I manually unrolled loops
100x to minimize the impact of looping logic (particularly the
now() call):
http://ciar.org/h/popcount.d
I ran this on an Intel Core i7-9750M @ 2.6GHz, with 192KB L1 /
1.5MB L2 / 12MB L3 caches, compiled with dmd v2.088.1 via "dmd -O
popcount.d" on Slackware-current.
More information about the Digitalmars-d
mailing list