M1 10x faster than Intel at integral division, throughput one 64-bit divide in two cycles

Thu May 13 12:06:01 UTC 2021

On Thursday, 13 May 2021 at 11:58:50 UTC, Witold Baryluk wrote:
> On Thursday, 13 May 2021 at 01:59:15 UTC, Andrei Alexandrescu 
> wrote:
>> https://www.reddit.com/r/programming/comments/nawerv/benchmarking_division_and_libdivide_on_apple_m1/
>>
>> Integral division is the strongest arithmetic operation.
>>
>> I have a friend who knows some M1 internals. He said it's 
>> really Star Trek stuff.
>>
>> This will seriously challenge other CPU producers.
>>
>
> No. It means nothing.
>
> 1) M1 is built on smaller manufacturing node, allowing to 
> "waste" more silicon area for such niche stuff.
>
> 2) He measured throughput in highly-div dense code. He didn't 
> measure actual speed (latency) of a divide. Anybody can make 
> integer division faster (higher throughput) by throwing more 
> execution units or fully pipeline the integer divisions. It 
> costs a lot of silicon, for zero gain, because real world code 
> doesn't have a div after div ever next or second instruction.
>
> He only checked on one x86 CPU. What about 3rd Gen Xeons (i.e. 
> Copper Lake)? Zen 2? Did he isolate memory effects?
>
> I see he used very high `count` in the loop: `count=524k`. That 
> is 2MiB / 4MiB of data. Granted, this data access pattern will 
> be easy to predict and prefetching will work really well, but 
> still this will be touching multiple levels of cache.
>
> There are data dependencies in the loop on the `sum`, making it 
> really hard to do speculatation.
>
> We don't see the assembly code, and don't know how much the 
> loops are unrolled / how much potential is there for 
> parallelism or hardware pipelineing.
>
> 3) Apple can get away with that, because they run on a leading 
> edge manufacturing node, clock lower to reduce power, and waste 
> silicon.
>
>
> My guess is: If you do a single divide without a loop, it most 
> likely will be the same on both platforms.
>
> There is nothing magical about M1 "fast" division.

I just tested, using his benchmark code, on my a bit older AMD 
Zen+ CPU, that is clocked 2.8GHz (so actually slower than either 
M1 or the tested Xeon):

I got 1.156ns per u32 divide using hardware divide. If I 
normalize this to 3.2GHz, it becomes 1.01ns.

0.399ns (or 0.349ns normalized to 3.2GHz) when using `libdivide`. 
So exactly same speed as M1 (0.351ms).

So, no, M1 is not 10 times faster than "x86".

Next time, exercise more critical thinking when reading 
"benchmark" claims.