How to squeeze more performance out of gcd?

Fri Jun 12 23:27:35 UTC 2020

So this week I finished a major feature in one of my projects, which
involves adding a custom number type that can do exact arithmetic with
quadratic irrationals (numbers of the form (a+b*√r)/c with integer
a,b,c, and fixed r). The custom number type, unsurprisingly, performs
about 5x worse than using `double`, but with the plus side that
arithmetic is exact so I never have to worry about round-off errors.

Profiling a typical use case reveals that the most prominent bottleneck
(50+% of total running time) is std.numeric.gcd.  Inspecting the
disassembly shows that it's using the fast binary GCD algorithm (Stein's
algorithm) with the bsf and cmovg instructions (this is with `ldc2
-mcpu=native -O3`), which is as fast as you can get for GCD, AFAIK.

Is there a way to squeeze more juice out of gcd?  Like is there some
novel algorithm or hack that could trim that 50% figure a little?

One of the reasons gcd is a bottleneck is because the quadratic
irrationals are normalized after every operation. I contemplated
suppressing some of these normalizations in order to boost performance,
but just today I had to fix a major bug caused by coefficient overflow,
so I'm not keen on postponing normalization unless there's no other way
around it.

T

-- 
For every argument for something, there is always an equal and opposite argument against it. Debates don't give answers, only wounded or inflated egos.