Performance issue with @fastmath and vectorization
rikki cattermole via digitalmars-d-ldc
digitalmars-d-ldc at puremagic.com
Fri Nov 11 19:30:47 PST 2016
On 12/11/2016 1:03 PM, dextorious wrote:
> As part of slowly learning the basics of programming in D, I ported some
> of my fluid dynamics code from C++ to D and quickly noticed a rather
> severe performance degradation by a factor of 2-3x. I've narrowed it
> down to a simple representative benchmark of virtually identical C++ and
> D code.
>
> The D version: http://pastebin.com/Rs9CUA5j
> The C++ code: http://pastebin.com/XzStHXA2
>
> I compile the D code using the latest beta release on GitHub, using the
> compiler switches -release -O5 -mcpu=haswell -boundscheck=off. The C++
> version is compiled using Clang 3.9.0 with the switches -std=c++14
> -Ofast -fno-exceptions -fno-rtti -flto -ffast-math -march=native, which
> is my usual configuration for numerical code.
>
> On my Haswell i7-4710HQ machine the C++ version runs in ~10ms/iteration
> while the D code takes 25ms. Comparing profiler output with the
> generated assembly code quickly reveals the reason - while Clang fully
> unrolls the inner loop and uses FMA instructions wherever possible, the
> inner loop assembly produced by LDC looks like this:
>
> 0.24 │6c0: vmovss (%r15,%rbp,4),%xmm4
> 1.03 │ vmovss (%r12,%rbp,4),%xmm5
> 3.51 │ add $0x4,%rdi
> 6.96 │ add $0x4,%rax
> 1.04 │6d4: vmulss (%rax,%rcx,1),%xmm4,%xmm4
> 4.66 │ vmulss (%rax,%rdx,1),%xmm5,%xmm5
> 8.44 │ vaddss %xmm4,%xmm5,%xmm4
> 1.09 │ vmulss %xmm0,%xmm4,%xmm5
> 3.73 │ vmulss %xmm4,%xmm5,%xmm4
> 7.48 │ vsubss %xmm3,%xmm4,%xmm4
> 1.13 │ vmulss %xmm1,%xmm4,%xmm4
> 2.00 │ vaddss %xmm2,%xmm5,%xmm5
> 3.46 │ vmovss 0x0(%r13,%rbp,4),%xmm6
> 7.85 │ vmulss (%rax,%rsi,1),%xmm6,%xmm6
> 2.50 │ vaddss %xmm4,%xmm5,%xmm4
> 6.49 │ vmulss %xmm4,%xmm6,%xmm4
> 25.48 │ vmovss %xmm4,(%rdi)
> 8.26 │ cmp $0x20,%rax
> 0.00 │ ↑ jne 6c0
>
> Am I doing something blatantly wrong here or have I run into a compiler
> limitation? Is there anything short of using intrinsics or calling C/C++
> code I can do here to get to performance parity?
>
> Also, while on the subject, is there a way to force LDC to apply the
> relaxed floating point model to the entire program, rather than
> individual functions (the equivalent of --fast-math)?
Just a thought but try this:
void compute_neq(float[] neq,
const float[] ux,
const float[] uy,
const float[] rho,
const float[] ex,
const float[] ey,
const float[] w,
const size_t N) @fastmath {
foreach(idx; 0 .. N*N) {
float usqr = ux[idx] * ux[idx] + uy[idx] * uy[idx];
foreach(q; 0 .. 9) {
float eu = 3.0f * (ex[q] * ux[idx] + ey[q] * uy[idx]);
float tmp = 1.0f + eu + 0.5f * eu * eu - 1.5f * usqr;
tmp *= w[q] * rho[idx];
neq[idx * 9 + q] = tmp;
}
}
}
It may not make any difference since it is semantically the same but I
thought at the very least rewriting it to be a bit more idiomatic may help.
More information about the digitalmars-d-ldc
mailing list