M1 10x faster than Intel at integral division, throughput one 64-bit divide in two cycles

Witold Baryluk witold.baryluk at gmail.com
Thu May 13 11:58:50 UTC 2021


On Thursday, 13 May 2021 at 01:59:15 UTC, Andrei Alexandrescu 
wrote:
> https://www.reddit.com/r/programming/comments/nawerv/benchmarking_division_and_libdivide_on_apple_m1/
>
> Integral division is the strongest arithmetic operation.
>
> I have a friend who knows some M1 internals. He said it's 
> really Star Trek stuff.
>
> This will seriously challenge other CPU producers.
>

No. It means nothing.

1) M1 is built on smaller manufacturing node, allowing to "waste" 
more silicon area for such niche stuff.

2) He measured throughput in highly-div dense code. He didn't 
measure actual speed (latency) of a divide. Anybody can make 
integer division faster (higher throughput) by throwing more 
execution units or fully pipeline the integer divisions. It costs 
a lot of silicon, for zero gain, because real world code doesn't 
have a div after div ever next or second instruction.

He only checked on one x86 CPU. What about 3rd Gen Xeons (i.e. 
Copper Lake)? Zen 2? Did he isolate memory effects?

I see he used very high `count` in the loop: `count=524k`. That 
is 2MiB / 4MiB of data. Granted, this data access pattern will be 
easy to predict and prefetching will work really well, but still 
this will be touching multiple levels of cache.

There are data dependencies in the loop on the `sum`, making it 
really hard to do speculatation.

We don't see the assembly code, and don't know how much the loops 
are unrolled / how much potential is there for parallelism or 
hardware pipelineing.

3) Apple can get away with that, because they run on a leading 
edge manufacturing node, clock lower to reduce power, and waste 
silicon.


My guess is: If you do a single divide without a loop, it most 
likely will be the same on both platforms.

There is nothing magical about M1 "fast" division.


More information about the Digitalmars-d mailing list