[phobos] FFT

Mon Aug 2 20:58:38 PDT 2010

I've figured out that a major bottleneck is actually the computation of 
sine and cosine values.  I had realized this to some degree and 
pre-computed the values for FFTs up to size 4096 at compile time.  
However, with huge lookup tables (apparently bigger than a CPU's L2 
cache), accessing the lookup table causes too many (CPU) cache misses, 
and is slower than computing the values.  On my machine, just computing 
2^^20 sine and cosine values once takes longer than FFTW does to produce 
a full FFT.  I therefore cannot even begin to imagine what tricks FFTW 
uses under the hood to get around this, unless it's doing something very 
clever to store them all in its plan object, and then somehow accessing 
them in an extremely cache-efficient manner.

It seems like this is a place where cache awareness could sneak into the 
design.  The values could be pre-computed via a static this at runtime 
and stored in an immutable array, and the amount of values to compute 
could be decided on the fly based on whatever will fit in L2 cache.

I tried this optimization and it gets me down to ~660 milliseconds on my 
crappy 512K L2 cache machine, or 615 if I use single-precision floats 
for the lookup table (leading to 2x the size for the lookup table).  
This is comparable to R, which is widely respected numerics software but 
isn't resorting to any black magic.  The only issue I see is that, with 
L2 caches being as huge as they've become lately, people might get upset 
if importing std.fft or std.numerics or whatever silently consumes 8 
megs of RAM via static this, but doing lazy initialization raises the 
standard ugly Singleton threading issue in an extremely 
performance-critical situation.

On 8/2/2010 10:23 AM, Don Clugston wrote:
> On 2 August 2010 15:41, David Simcha<dsimcha at gmail.com>  wrote:
>    
>> Unfortunately I just downloaded the benchmark program for FFTW and my FFT is
>> a ton slower, depending on how you look at it.  Using size 2^20 as my
>> benchmark, FFTW takes about 131 seconds to create its plan, even using
>> -oestimate, the fastest planner.  However, the plan can be reused if
>> performing multiple FFTs of the same size, and once the plan is created, it
>> can do an FFT of size 2^20 in about 53 milliseconds (which I find almost
>> unbelievable because even sorting an array of size 2^20 using a
>> well-optimized quick sort takes almost that long, and FFT seems like it
>> should should have a much larger constant than quick sort), compared to my
>> latest improvements to around 730 on the hardware I'm benchmarking on.
>>
>> On the other hand, for one-off use cases, the lack of needing to create a
>> plan is a big win, both from a speed and a simplicity of API point of view.
>>   Benchmarking against R, which doesn't appear to use plans, making the
>> comparison somewhat more relevant, things look better for my FFT:  R takes
>> about 610 milliseconds for a size 2^20 pure real FFT.
>>      
> All you're seeing is the L2 cache. Did you see the link I posted to
> the NG about the internals of FFTW? The graph at the top shows FFTW is
> 40 times faster than the 'numerical recipes' code for 2^^20. So your
> factor of 13 isn't so bad. Based on that graph, if you reduce it to
> say 2^^15, the factor should drop significantly. Adding a little bit
> of cache awareness (using core.cpuid) should be able to avoid the
> performance cliff.
> Also, DMD's floating point optimiser is so primitive, you lose up to a
> factor of two immediately.
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
>
>