Counting bits in a ulong

Fri Jun 5 19:45:36 UTC 2020

On Fri, Jun 05, 2020 at 07:12:43PM +0000, ttk via Digitalmars-d wrote:
> On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:
> > So recently I needed to implement a function to count the number of
> > 1 bits in a ulong inside a (very) hot inner loop.
> 
> This made me wonder how well a lookup table approach would compare to
> these logical methods (and it was an excuse to blow off work a bit to
> fiddle with D).

Hooray for having an excuse to play with D! ;-)

> The popcnt instruction is going to blow away all of these, of course,
> but it still seems like a worthwhile exercise, if only to get a feel
> for how purely in-register logic performance compares to cache access
> performance on modern "big" CPUs.
> 
> ttk at kirov:/home/ttk/prog/D$ ./popcount
> Starting tests, which take about 10 seconds each
> Naive:         5596740 iter/sec
> Parallel:      166352970 iter/sec
> Kernighan:     47668800 iter/sec
> Lookup 8-bit:  467161540 iter/sec
> Lookup 16-bit: 826179570 iter/sec
> 
> It surprised me a bit that the 16-bit lookup wasn't closer to 2.0x the
> performance of the 8-bit lookup, since both lookup tables fit well
> within the L1 cache.

You might be running into alignment issues, possibly. For one thing, why
does LOOKUP_1 have 257 elements instead of 256? That last byte is never
accessed. Similarly, LOOKUP_2 doesn't need 65537 elements, that last
element is redundant.

> Source is here .. a bit large, since I manually unrolled loops 100x to
> minimize the impact of looping logic (particularly the now() call):
> 
> http://ciar.org/h/popcount.d

There's no need to manually unroll the code: just use static
foreach, since the code is identical every iteration! :-P

> I ran this on an Intel Core i7-9750M @ 2.6GHz, with 192KB L1 / 1.5MB
> L2 / 12MB L3 caches, compiled with dmd v2.088.1 via "dmd -O
> popcount.d" on Slackware-current.

I don't trust performance results with dmd, so I tested it on `ldc2 -O3`
instead:

	Starting tests, which take about 10 seconds each
	Naive:         25665230 iter/sec
	Parallel:      124692900 iter/sec
	Kernighan:     40680010 iter/sec
	Lookup 8-bit:  1532505390 iter/sec
	Lookup 16-bit: 1690195340 iter/sec

Something is up with the 16-bit lookups. Why is it only barely faster
than 8-bit lookup?

Disassembling shows that the code was *not* duplicated 100 times;
apparently LDC's aggressive optimizer noticed that `count` is reassigned
every unrolled iteration, so all except the last are no-ops. You need to
do something with `count` that isn't immediately killed by the next
unrolled iteration, otherwise your painstakingly copy-n-pasted
iterations will all be elided by the optimizer. :-P

OR'ing `count` into ACCUMULATOR after every count is probably what you
need to do here.

T

-- 
"640K ought to be enough" -- Bill G. (allegedly), 1984. "The Internet is not a primary goal for PC usage" -- Bill G., 1995. "Linux has no impact on Microsoft's strategy" -- Bill G., 1999.