FFT in D (using SIMD) and benchmarks

Wed Jan 25 16:12:48 PST 2012

On Wednesday, 25 January 2012 at 18:36:35 UTC, Manu wrote:
> Can you paste disassembly's of the GDC code and the G++ code?
> I imagine there's something trivial in the scheduler that GDC 
> has missed.

> Like the other day I noticed GDC was unnecessarily generating a 
> stack frame
> for leaf functions, which Iain already fixed.

I've uploaded it here:

https://github.com/downloads/jerro/pfft/disassembly.zip

For g++ there is the disassembly of the entire object file. For 
gdc that included parts of the standard library and was about 
170k lines so I coppied just the functions that took the most 
time. I have also included percentages of time taken by functions 
for "./bench 22". I noticed that 2% of time is taken by memset. 
This could be caused by static array initialization in 
fft_passes_stride, so I'll void initialize static arrays there.

For very small sizes the code compiled with g++ is probably 
faster because it's more aggressively inlined.

> I'd also be interested to try out my experimental std.simd 
> (portable)
> library in the context of your FFT... might give that a shot, I 
> think it'll
> work well.

For that you probably only need to replace the code in sse.d 
(create stdsimd.d or something). The type that you pass as a 
first parameter for the FFT template should define at least the 
following:

- an alias "vec" for the vector type
- an alias "T" for the scalar type
- an enum vec_size which is the number of scalars in a vector - a 
function "static vec scalar_to_vector(T)", which takes a scalar 
and returns
a vector with all elements set to that scalar
- a function "static void bit_reverse_swap_16(T * p0, T * p1, T * 
p2, T * p3, size_t i1, size_t i2)"
- a function "static void bit_reverse_16(T * p0, T * p1, T * p2, 
T * p3, size_t i)"

You can start with using Scalar!T.bit_reverse_16 and 
Scalar!T.bit_reverse_swap_16 from fft_impl.d but that won't be 
very fast. Those to funcions do the following:

bit_reverse_16: This one reads 16 elements - four from p0 + i, 
four from p1 + i, four from p2 + i and four from p3 + i. Then it 
"bit reverses" them - this means that for each pair of indices j 
and k such that k has the reverse
bits of j, it swaps element at j and element at k. I define the 
index here so that the element that was read from address pn + i 
+ m has index n * 4 + m.
After bit reversing them, it writes the elements back. You cann 
assume that
(p0 + i), (p1 + i), (p2 + i) and (p3 + i) are all aligned to 
4*T.sizeof.

bit_reverse_swap_16: This one reads 16 elements from  (p0 + i), 
(p1 + i), (p2 + i) and (p3 + i)
and bitreverses them and also the elements from (p0 + j), (p1 + 
j), (p2 + j) and (p3 + j)
and bitreverses those. Then it writes the elements that were read 
from offsets i to offsets j and vice versa. You can assue that 
(p0 + i), (p1 + i), (p2 + i) and (p3 + i),
(p0 + j), (p1 + j), (p2 + j) and (p3 + j) are all aligned to 
4*T.sizeof.

If you want to speed up the fft for larg sizes (greater or equal 
to 1<<Options.largeLimit, which is roughly those that do not fit 
in L1 cache) a bit, you can also write those functions:

- static void interleave(int n)(vec a, vec b, ref vec c, ref vec 
d)
Here n will be a power of two larger than 1 and less or eaqual to 
vec_size.
The function breaks vector a into n equaly sized parts and does 
the same
for b, then interleaves those parts and writes them to c,d. For 
example:
for vec_size = 4 and n = 4 it should write (a0, b0, a1, b1) to c 
and (a2, b2, a3, b3) to d. For vec_size = 4 and n = 2 it should 
write (a0, a1, b0, b1) to c and (a2,a3,b2,b3) to d.

- static void deinterleave(int n)(vec a, vec b, ref vec c, ref 
vec d)
This is an inverse of interleave!n(a, b, c, d)

- static void complex_array_to_real_imag_vec(int n)(float * arr, 
ref vec rr, ref vec ri)
This one reads n pairs from arr. It repeats the first element of 
each pair
vec_size / n times and writes them to rr. It also repeats the 
second element
of each pair vec_size / n times and writes them to ri. For 
example: if vec_size
is 4 and n is 2 then the it should write (arr[0], arr[0], arr[2], 
arr[2])
to rr and (arr[1], arr[1], arr[3], arr[3]) to ri. You can assume 
that arr
is aligned to 2*n*T.sizeof.

The functions interleave, deinterleave and 
complex_array_to_real_imag_vec
are called in the inner loop, so if you can't make them very 
fast, you should
just ommit them. The algorithm will then fall back (it checks for 
the presence
of interleave!2) to a scalar one for the passes where these 
functions would be used otherwise.