M1 10x faster than Intel at integral division, throughput one 64-bit divide in two cycles
Witold Baryluk
witold.baryluk at gmail.com
Thu May 13 11:58:50 UTC 2021
On Thursday, 13 May 2021 at 01:59:15 UTC, Andrei Alexandrescu
wrote:
> https://www.reddit.com/r/programming/comments/nawerv/benchmarking_division_and_libdivide_on_apple_m1/
>
> Integral division is the strongest arithmetic operation.
>
> I have a friend who knows some M1 internals. He said it's
> really Star Trek stuff.
>
> This will seriously challenge other CPU producers.
>
No. It means nothing.
1) M1 is built on smaller manufacturing node, allowing to "waste"
more silicon area for such niche stuff.
2) He measured throughput in highly-div dense code. He didn't
measure actual speed (latency) of a divide. Anybody can make
integer division faster (higher throughput) by throwing more
execution units or fully pipeline the integer divisions. It costs
a lot of silicon, for zero gain, because real world code doesn't
have a div after div ever next or second instruction.
He only checked on one x86 CPU. What about 3rd Gen Xeons (i.e.
Copper Lake)? Zen 2? Did he isolate memory effects?
I see he used very high `count` in the loop: `count=524k`. That
is 2MiB / 4MiB of data. Granted, this data access pattern will be
easy to predict and prefetching will work really well, but still
this will be touching multiple levels of cache.
There are data dependencies in the loop on the `sum`, making it
really hard to do speculatation.
We don't see the assembly code, and don't know how much the loops
are unrolled / how much potential is there for parallelism or
hardware pipelineing.
3) Apple can get away with that, because they run on a leading
edge manufacturing node, clock lower to reduce power, and waste
silicon.
My guess is: If you do a single divide without a loop, it most
likely will be the same on both platforms.
There is nothing magical about M1 "fast" division.
More information about the Digitalmars-d
mailing list