M1 10x faster than Intel at integral division, throughput one 64-bit divide in two cycles

Witold Baryluk witold.baryluk at gmail.com
Thu May 13 12:06:01 UTC 2021


On Thursday, 13 May 2021 at 11:58:50 UTC, Witold Baryluk wrote:
> On Thursday, 13 May 2021 at 01:59:15 UTC, Andrei Alexandrescu 
> wrote:
>> https://www.reddit.com/r/programming/comments/nawerv/benchmarking_division_and_libdivide_on_apple_m1/
>>
>> Integral division is the strongest arithmetic operation.
>>
>> I have a friend who knows some M1 internals. He said it's 
>> really Star Trek stuff.
>>
>> This will seriously challenge other CPU producers.
>>
>
> No. It means nothing.
>
> 1) M1 is built on smaller manufacturing node, allowing to 
> "waste" more silicon area for such niche stuff.
>
> 2) He measured throughput in highly-div dense code. He didn't 
> measure actual speed (latency) of a divide. Anybody can make 
> integer division faster (higher throughput) by throwing more 
> execution units or fully pipeline the integer divisions. It 
> costs a lot of silicon, for zero gain, because real world code 
> doesn't have a div after div ever next or second instruction.
>
> He only checked on one x86 CPU. What about 3rd Gen Xeons (i.e. 
> Copper Lake)? Zen 2? Did he isolate memory effects?
>
> I see he used very high `count` in the loop: `count=524k`. That 
> is 2MiB / 4MiB of data. Granted, this data access pattern will 
> be easy to predict and prefetching will work really well, but 
> still this will be touching multiple levels of cache.
>
> There are data dependencies in the loop on the `sum`, making it 
> really hard to do speculatation.
>
> We don't see the assembly code, and don't know how much the 
> loops are unrolled / how much potential is there for 
> parallelism or hardware pipelineing.
>
> 3) Apple can get away with that, because they run on a leading 
> edge manufacturing node, clock lower to reduce power, and waste 
> silicon.
>
>
> My guess is: If you do a single divide without a loop, it most 
> likely will be the same on both platforms.
>
> There is nothing magical about M1 "fast" division.

I just tested, using his benchmark code, on my a bit older AMD 
Zen+ CPU, that is clocked 2.8GHz (so actually slower than either 
M1 or the tested Xeon):

I got 1.156ns per u32 divide using hardware divide. If I 
normalize this to 3.2GHz, it becomes 1.01ns.

0.399ns (or 0.349ns normalized to 3.2GHz) when using `libdivide`. 
So exactly same speed as M1 (0.351ms).

So, no, M1 is not 10 times faster than "x86".

Next time, exercise more critical thinking when reading 
"benchmark" claims.



More information about the Digitalmars-d mailing list