How to squeeze more performance out of gcd?

Sat Jun 13 14:25:21 UTC 2020

On Friday, 12 June 2020 at 23:27:35 UTC, H. S. Teoh wrote:
> So this week I finished a major feature in one of my projects, 
> which involves adding a custom number type that can do exact 
> arithmetic with quadratic irrationals (numbers of the form 
> (a+b*√r)/c with integer a,b,c, and fixed r). The custom number 
> type, unsurprisingly, performs about 5x worse than using 
> `double`, but with the plus side that arithmetic is exact so I 
> never have to worry about round-off errors.
>
> Profiling a typical use case reveals that the most prominent 
> bottleneck (50+% of total running time) is std.numeric.gcd.  
> Inspecting the disassembly shows that it's using the fast 
> binary GCD algorithm (Stein's algorithm) with the bsf and cmovg 
> instructions (this is with `ldc2 -mcpu=native -O3`), which is 
> as fast as you can get for GCD, AFAIK.
>
> Is there a way to squeeze more juice out of gcd?  Like is there 
> some novel algorithm or hack that could trim that 50% figure a 
> little?
> [...]
> T

I was doing some competitive programming last month and fast-gcd 
came up, this was the fastest I found back then: 
https://lemire.me/blog/2013/12/26/fastest-way-to-compute-the-greatest-common-divisor/

Of course it turned out the actual fast approach to my problem 
was to skip the gcd calculations entirely, because they were 
irrelevant, but that blog post may help you.