dclugston at googlemail.com
Mon Aug 2 07:23:39 PDT 2010
On 2 August 2010 15:41, David Simcha <dsimcha at gmail.com> wrote:
> Unfortunately I just downloaded the benchmark program for FFTW and my FFT is
> a ton slower, depending on how you look at it. Using size 2^20 as my
> benchmark, FFTW takes about 131 seconds to create its plan, even using
> -oestimate, the fastest planner. However, the plan can be reused if
> performing multiple FFTs of the same size, and once the plan is created, it
> can do an FFT of size 2^20 in about 53 milliseconds (which I find almost
> unbelievable because even sorting an array of size 2^20 using a
> well-optimized quick sort takes almost that long, and FFT seems like it
> should should have a much larger constant than quick sort), compared to my
> latest improvements to around 730 on the hardware I'm benchmarking on.
> On the other hand, for one-off use cases, the lack of needing to create a
> plan is a big win, both from a speed and a simplicity of API point of view.
> Benchmarking against R, which doesn't appear to use plans, making the
> comparison somewhat more relevant, things look better for my FFT: R takes
> about 610 milliseconds for a size 2^20 pure real FFT.
All you're seeing is the L2 cache. Did you see the link I posted to
the NG about the internals of FFTW? The graph at the top shows FFTW is
40 times faster than the 'numerical recipes' code for 2^^20. So your
factor of 13 isn't so bad. Based on that graph, if you reduce it to
say 2^^15, the factor should drop significantly. Adding a little bit
of cache awareness (using core.cpuid) should be able to avoid the
Also, DMD's floating point optimiser is so primitive, you lose up to a
factor of two immediately.
More information about the phobos