Rather Bizarre slow downs using Complex!float with avx (ldc).
james.p.leblanc
james.p.leblanc at gmail.com
Fri Oct 1 08:32:14 UTC 2021
On Thursday, 30 September 2021 at 16:52:57 UTC, Johan wrote:
> On Thursday, 30 September 2021 at 16:40:03 UTC, james.p.leblanc
> Generally, for performance issues like this you need to study
> assembly output (`--output-s`) or LLVM IR (`--output-ll`).
> First thing I would look out for is function inlining yes/no.
>
> cheers,
> Johan
Johan,
Thanks kindly for your reply. As suggested, I have looked at the
assembly output.
Strangely the fused multiplay add are indeed there in the avx
version, but example
still runs slower for **Complex!float** data type.
I have stripped the code down to a minimum, which demonstrates
the weird result:
```d
import ldc.attributes; // with or without this line makes no
difference
import std.stdio;
import std.datetime.stopwatch;
import std.complex;
alias T = Complex!float;
auto typestr = "COMPLEX FLOAT";
/* alias T = Complex!double; */
/* auto typestr = "COMPLEX DOUBLE"; */
auto alpha = cast(T) complex(0.1, -0.2); // dummy values to fill
arrays
auto beta = cast(T) complex(-0.7, 0.6);
auto dotprod( T[] x, T[] y)
{
auto sum = cast(T) 0;
foreach( size_t i ; 0 .. x.length)
sum += x[i] * conj(y[i]);
return sum;
}
void main()
{
int nEle = 1000;
int nIter = 2000;
auto startTime = MonoTime.currTime;
auto dur = cast(double)
(MonoTime.currTime-startTime).total!"usecs";
T[] x, y;
x.length = nEle;
y.length = nEle;
T z;
x[] = alpha;
y[] = beta;
startTime = MonoTime.currTime;
foreach( i ; 0 .. nIter){
foreach( j ; 0 .. nIter){
z = dotprod(x,y);
}
}
auto etime = cast(double)
(MonoTime.currTime-startTime).total!"msecs" / 1.0e3;
writef(" result: % 5.2f%+5.2fi comp time: %5.2f \n", z.re,
z.im, etime);
}
```
For convenience I include bash script used compile/run/generate
assembly code / and grep:
```bash
echo
echo "With AVX:"
ldc2 -O3 -release question.d --ffast-math -mcpu=haswell
question
ldc2 -output-s -O3 -release question.d --ffast-math -mcpu=haswell
mv question.s question_with_avx.s
echo
echo "Without AVX"
ldc2 -O3 -release question.d
question
ldc2 -output-s -O3 -release question.d
mv question.s question_without_avx.s
echo
echo "fused multiply adds are found in avx code (as desired)"
grep vfmadd *.s /dev/null
```
Here is output when run on my machine:
```console
With AVX:
result: -190.00+80.00i comp time: 6.45
Without AVX
result: -190.00+80.00i comp time: 5.74
fused multiply adds are found in avx code (as desired)
question_with_avx.s: vfmadd231ss %xmm2, %xmm5, %xmm3
question_with_avx.s: vfmadd231ss %xmm0, %xmm2, %xmm3
question_with_avx.s: vfmadd231ss %xmm2, %xmm4, %xmm1
question_with_avx.s: vfmadd231ss %xmm3, %xmm5, %xmm1
question_with_avx.s: vfmadd231ss %xmm3, %xmm1, %xmm0
```
Repeating the experiment after changing to datatype of
Complex!double
shows AVX code to be twice as fast (perhaps more aligned with
expectations).
**I admit my confusion as to why the Complex!float is
misbehaving.**
Does anyone have insight to what is happening?
Thanks,
James
More information about the Digitalmars-d-learn
mailing list