Counting bits in a ulong
H. S. Teoh
hsteoh at quickfur.ath.cx
Fri Jun 5 19:45:36 UTC 2020
On Fri, Jun 05, 2020 at 07:12:43PM +0000, ttk via Digitalmars-d wrote:
> On Thursday, 4 June 2020 at 01:16:13 UTC, H. S. Teoh wrote:
> > So recently I needed to implement a function to count the number of
> > 1 bits in a ulong inside a (very) hot inner loop.
>
> This made me wonder how well a lookup table approach would compare to
> these logical methods (and it was an excuse to blow off work a bit to
> fiddle with D).
Hooray for having an excuse to play with D! ;-)
> The popcnt instruction is going to blow away all of these, of course,
> but it still seems like a worthwhile exercise, if only to get a feel
> for how purely in-register logic performance compares to cache access
> performance on modern "big" CPUs.
>
> ttk at kirov:/home/ttk/prog/D$ ./popcount
> Starting tests, which take about 10 seconds each
> Naive: 5596740 iter/sec
> Parallel: 166352970 iter/sec
> Kernighan: 47668800 iter/sec
> Lookup 8-bit: 467161540 iter/sec
> Lookup 16-bit: 826179570 iter/sec
>
> It surprised me a bit that the 16-bit lookup wasn't closer to 2.0x the
> performance of the 8-bit lookup, since both lookup tables fit well
> within the L1 cache.
You might be running into alignment issues, possibly. For one thing, why
does LOOKUP_1 have 257 elements instead of 256? That last byte is never
accessed. Similarly, LOOKUP_2 doesn't need 65537 elements, that last
element is redundant.
> Source is here .. a bit large, since I manually unrolled loops 100x to
> minimize the impact of looping logic (particularly the now() call):
>
> http://ciar.org/h/popcount.d
There's no need to manually unroll the code: just use static
foreach, since the code is identical every iteration! :-P
> I ran this on an Intel Core i7-9750M @ 2.6GHz, with 192KB L1 / 1.5MB
> L2 / 12MB L3 caches, compiled with dmd v2.088.1 via "dmd -O
> popcount.d" on Slackware-current.
I don't trust performance results with dmd, so I tested it on `ldc2 -O3`
instead:
Starting tests, which take about 10 seconds each
Naive: 25665230 iter/sec
Parallel: 124692900 iter/sec
Kernighan: 40680010 iter/sec
Lookup 8-bit: 1532505390 iter/sec
Lookup 16-bit: 1690195340 iter/sec
Something is up with the 16-bit lookups. Why is it only barely faster
than 8-bit lookup?
Disassembling shows that the code was *not* duplicated 100 times;
apparently LDC's aggressive optimizer noticed that `count` is reassigned
every unrolled iteration, so all except the last are no-ops. You need to
do something with `count` that isn't immediately killed by the next
unrolled iteration, otherwise your painstakingly copy-n-pasted
iterations will all be elided by the optimizer. :-P
OR'ing `count` into ACCUMULATOR after every count is probably what you
need to do here.
T
--
"640K ought to be enough" -- Bill G. (allegedly), 1984. "The Internet is not a primary goal for PC usage" -- Bill G., 1995. "Linux has no impact on Microsoft's strategy" -- Bill G., 1999.
More information about the Digitalmars-d
mailing list