[phobos] FFT

Tue Aug 3 21:01:49 PDT 2010

Ok, I've got some more optimizations.  I figured out that it's worth 
trading off some space to make the sin/cos lookup tables be in 
contiguous memory blocks for all sub-FFTs.  I'm now down to ~340 ms on 
my benchmark, but fresh out of optimization ideas.  Here are the 
performance characteristics of the code in case anyone has any clever ideas:

1.  Skipping the slowFourier2 base case and simply returning (Yes, this 
produces incorrect results, but it's just to see the cost of it) shaves 
off a surprising ~110 milliseconds or roughly 30%.  This is probably 
because slowFourier2 always accesses two memory positions that are 
virtually guaranteed to be cache misses.

2.  Switching from doubles to floats for both input and output gains ~60 
milliseconds, probably due to better cache performance.

3.  Skipping the entire twiddle factor loop by sticking a return 
statement before it saves about 190 milliseconds, or only a little over 
half of the total execution time.  This is surprising because a naive 
reader of the code would probably think that almost all execution time 
is being spent in this loop.

4.  I didn't bother (at least, not yet) with a breadth first base case 
because recursion overhead seems fairly negligible.  The code with the 
return statement before the twiddle factor loop and with the 
slowFourier2 call commented out executes in about 40 ms; this is only 
~12%, the recursive implementation is more cache efficient up to a 
point, and the breadth-first impl wouldn't be entirely overhead-free.

I've attached the latest code for review.  If everyone's happy with it 
and noone has any more clever optimization ideas, I'd like to include it 
in the next release.  If I do, I guess std.numerics would be the place 
for it?

On 8/2/2010 10:23 AM, Don Clugston wrote:
> On 2 August 2010 15:41, David Simcha<dsimcha at gmail.com>  wrote:
>    
>> Unfortunately I just downloaded the benchmark program for FFTW and my FFT is
>> a ton slower, depending on how you look at it.  Using size 2^20 as my
>> benchmark, FFTW takes about 131 seconds to create its plan, even using
>> -oestimate, the fastest planner.  However, the plan can be reused if
>> performing multiple FFTs of the same size, and once the plan is created, it
>> can do an FFT of size 2^20 in about 53 milliseconds (which I find almost
>> unbelievable because even sorting an array of size 2^20 using a
>> well-optimized quick sort takes almost that long, and FFT seems like it
>> should should have a much larger constant than quick sort), compared to my
>> latest improvements to around 730 on the hardware I'm benchmarking on.
>>
>> On the other hand, for one-off use cases, the lack of needing to create a
>> plan is a big win, both from a speed and a simplicity of API point of view.
>>   Benchmarking against R, which doesn't appear to use plans, making the
>> comparison somewhat more relevant, things look better for my FFT:  R takes
>> about 610 milliseconds for a size 2^20 pure real FFT.
>>      
> All you're seeing is the L2 cache. Did you see the link I posted to
> the NG about the internals of FFTW? The graph at the top shows FFTW is
> 40 times faster than the 'numerical recipes' code for 2^^20. So your
> factor of 13 isn't so bad. Based on that graph, if you reduce it to
> say 2^^15, the factor should drop significantly. Adding a little bit
> of cache awareness (using core.cpuid) should be able to avoid the
> performance cliff.
> Also, DMD's floating point optimiser is so primitive, you lose up to a
> factor of two immediately.
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
>
>    

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: fft3.d
URL: <http://lists.puremagic.com/pipermail/phobos/attachments/20100804/20487348/attachment.ksh>