FFT in D (using SIMD) and benchmarks
a
a at a.com
Tue Jan 24 17:04:59 PST 2012
On Wednesday, 25 January 2012 at 00:49:15 UTC, bearophile wrote:
> a:
>
>> Because dmd currently doesn't have an intrinsic for the SHUFPS
>> instruction I've included a version block with some GDC
>> specific code (this gave me a speedup of up to 80%).
>
> It seems an instruction worth having in dmd too.
>
>
>> Chart: http://cloud.github.com/downloads/jerro/pfft/image.png
>
> I know your code is relatively simple, so it's not meant to be
> the fastest on the ground, but in your nice graph _as reference
> point_ I'd like to see a line for the FTTW too. Such line is
> able to show us how close or how far all this is from an
> industry standard performance.
> (And if possible I'd like to see two lines for the LDC2
> compiler too.)
>
> Bye,
> bearophile
"bench" program in the fftw test directory gives this when run in
a loop:
2 Problem: 4, setup: 21.00 us, time: 11.16 ns, ``mflops'': 3583.7
3 Problem: 8, setup: 21.00 us, time: 22.84 ns, ``mflops'': 5254.3
4 Problem: 16, setup: 24.00 us, time: 46.83 ns, ``mflops'': 6833.9
5 Problem: 32, setup: 290.00 us, time: 56.71 ns, ``mflops'': 14108
6 Problem: 64, setup: 1.00 ms, time: 111.47 ns, ``mflops'': 17225
7 Problem: 128, setup: 2.06 ms, time: 227.22 ns, ``mflops'': 19717
8 Problem: 256, setup: 3.99 ms, time: 499.48 ns, ``mflops'': 20501
9 Problem: 512, setup: 7.11 ms, time: 1.10 us, ``mflops'': 20958
10 Problem: 1024, setup: 14.51 ms, time: 2.47 us, ``mflops'':
20690
11 Problem: 2048, setup: 30.18 ms, time: 5.72 us, ``mflops'':
19693
12 Problem: 4096, setup: 61.20 ms, time: 13.20 us, ``mflops'':
18622
13 Problem: 8192, setup: 127.97 ms, time: 36.02 us, ``mflops'':
14784
14 Problem: 16384, setup: 252.58 ms, time: 82.43 us, ``mflops'':
13913
15 Problem: 32768, setup: 490.55 ms, time: 194.14 us, ``mflops'':
12659
16 Problem: 65536, setup: 1.13 s, time: 422.50 us, ``mflops'':
12409
17 Problem: 131072, setup: 2.67 s, time: 994.75 us, ``mflops'':
11200
18 Problem: 262144, setup: 5.77 s, time: 2.28 ms, ``mflops'':
10338
19 Problem: 524288, setup: 1.72 s, time: 9.50 ms, ``mflops'':
5243.4
20 Problem: 1048576, setup: 5.51 s, time: 20.55 ms, ``mflops'':
5102.8
21 Problem: 2097152, setup: 9.55 s, time: 42.88 ms, ``mflops'':
5135.2
22 Problem: 4194304, setup: 26.51 s, time: 88.56 ms, ``mflops'':
5209.8
This was with fftw compiled for single precision and with SSE,
but without AVX support. When I compiled fftw with AVX support,
the peak was at about 30 GFLOPS, IIRC. It is possible that it
would be even faster if I configured it in a different way. The
C++ version of my FFT also supports AVX and gets to about 24
GFLOPS when using it. If AVX types will be added to D, I will
port that part too.
More information about the Digitalmars-d
mailing list