FFT in D (using SIMD) and benchmarks

a a at a.com
Tue Jan 24 17:04:59 PST 2012


On Wednesday, 25 January 2012 at 00:49:15 UTC, bearophile wrote:
> a:
>
>> Because dmd currently doesn't have an intrinsic for the SHUFPS 
>> instruction I've included a version block with some GDC 
>> specific code (this gave me a speedup of up to 80%).
>
> It seems an instruction worth having in dmd too.
>
>
>> Chart: http://cloud.github.com/downloads/jerro/pfft/image.png
>
> I know your code is relatively simple, so it's not meant to be 
> the fastest on the ground, but in your nice graph _as reference 
> point_ I'd like to see a line for the FTTW too. Such line is 
> able to show us how close or how far all this is from an 
> industry standard performance.
> (And if possible I'd like to see two lines for the LDC2 
> compiler too.)
>
> Bye,
> bearophile

"bench" program in the fftw test directory gives this when run in 
a loop:


2	Problem: 4, setup: 21.00 us, time: 11.16 ns, ``mflops'': 3583.7
3	Problem: 8, setup: 21.00 us, time: 22.84 ns, ``mflops'': 5254.3
4	Problem: 16, setup: 24.00 us, time: 46.83 ns, ``mflops'': 6833.9
5	Problem: 32, setup: 290.00 us, time: 56.71 ns, ``mflops'': 14108
6	Problem: 64, setup: 1.00 ms, time: 111.47 ns, ``mflops'': 17225
7	Problem: 128, setup: 2.06 ms, time: 227.22 ns, ``mflops'': 19717
8	Problem: 256, setup: 3.99 ms, time: 499.48 ns, ``mflops'': 20501
9	Problem: 512, setup: 7.11 ms, time: 1.10 us, ``mflops'': 20958
10	Problem: 1024, setup: 14.51 ms, time: 2.47 us, ``mflops'': 
20690
11	Problem: 2048, setup: 30.18 ms, time: 5.72 us, ``mflops'': 
19693
12	Problem: 4096, setup: 61.20 ms, time: 13.20 us, ``mflops'': 
18622
13	Problem: 8192, setup: 127.97 ms, time: 36.02 us, ``mflops'': 
14784
14	Problem: 16384, setup: 252.58 ms, time: 82.43 us, ``mflops'': 
13913
15	Problem: 32768, setup: 490.55 ms, time: 194.14 us, ``mflops'': 
12659
16	Problem: 65536, setup: 1.13 s, time: 422.50 us, ``mflops'': 
12409
17	Problem: 131072, setup: 2.67 s, time: 994.75 us, ``mflops'': 
11200
18	Problem: 262144, setup: 5.77 s, time: 2.28 ms, ``mflops'': 
10338
19	Problem: 524288, setup: 1.72 s, time: 9.50 ms, ``mflops'': 
5243.4
20	Problem: 1048576, setup: 5.51 s, time: 20.55 ms, ``mflops'': 
5102.8
21	Problem: 2097152, setup: 9.55 s, time: 42.88 ms, ``mflops'': 
5135.2
22	Problem: 4194304, setup: 26.51 s, time: 88.56 ms, ``mflops'': 
5209.8

This was with fftw compiled for single precision and with SSE, 
but without AVX support. When I compiled fftw with AVX support, 
the peak was at about 30 GFLOPS, IIRC. It is possible that it 
would be even faster if I configured it in a different way. The 
C++ version of my FFT also supports AVX and gets to about 24 
GFLOPS when using it. If AVX types will be added to D, I will 
port that part too.


More information about the Digitalmars-d mailing list