Performance issue with @fastmath and vectorization

Fri Nov 11 19:30:47 PST 2016

On 12/11/2016 1:03 PM, dextorious wrote:
> As part of slowly learning the basics of programming in D, I ported some
> of my fluid dynamics code from C++ to D and quickly noticed a rather
> severe performance degradation by a factor of 2-3x. I've narrowed it
> down to a simple representative benchmark of virtually identical C++ and
> D code.
>
> The D version: http://pastebin.com/Rs9CUA5j
> The C++ code:  http://pastebin.com/XzStHXA2
>
> I compile the D code using the latest beta release on GitHub, using the
> compiler switches -release -O5 -mcpu=haswell -boundscheck=off. The C++
> version is compiled using Clang 3.9.0 with the switches -std=c++14
> -Ofast -fno-exceptions -fno-rtti -flto -ffast-math -march=native, which
> is my usual configuration for numerical code.
>
> On my Haswell i7-4710HQ machine the C++ version runs in ~10ms/iteration
> while the D code takes 25ms. Comparing profiler output with the
> generated assembly code quickly reveals the reason - while Clang fully
> unrolls the inner loop and uses FMA instructions wherever possible, the
> inner loop assembly produced by LDC looks like this:
>
>   0.24 │6c0:   vmovss (%r15,%rbp,4),%xmm4
>   1.03 │       vmovss (%r12,%rbp,4),%xmm5
>   3.51 │       add    $0x4,%rdi
>   6.96 │       add    $0x4,%rax
>   1.04 │6d4:   vmulss (%rax,%rcx,1),%xmm4,%xmm4
>   4.66 │       vmulss (%rax,%rdx,1),%xmm5,%xmm5
>   8.44 │       vaddss %xmm4,%xmm5,%xmm4
>   1.09 │       vmulss %xmm0,%xmm4,%xmm5
>   3.73 │       vmulss %xmm4,%xmm5,%xmm4
>   7.48 │       vsubss %xmm3,%xmm4,%xmm4
>   1.13 │       vmulss %xmm1,%xmm4,%xmm4
>   2.00 │       vaddss %xmm2,%xmm5,%xmm5
>   3.46 │       vmovss 0x0(%r13,%rbp,4),%xmm6
>   7.85 │       vmulss (%rax,%rsi,1),%xmm6,%xmm6
>   2.50 │       vaddss %xmm4,%xmm5,%xmm4
>   6.49 │       vmulss %xmm4,%xmm6,%xmm4
>  25.48 │       vmovss %xmm4,(%rdi)
>   8.26 │       cmp    $0x20,%rax
>   0.00 │     ↑ jne    6c0
>
> Am I doing something blatantly wrong here or have I run into a compiler
> limitation? Is there anything short of using intrinsics or calling C/C++
> code I can do here to get to performance parity?
>
> Also, while on the subject, is there a way to force LDC to apply the
> relaxed floating point model to the entire program, rather than
> individual functions (the equivalent of --fast-math)?

Just a thought but try this:

void compute_neq(float[] neq,
                  const float[] ux,
                  const float[] uy,
                  const float[] rho,
                  const float[] ex,
                  const float[] ey,
                  const float[] w,
                  const size_t N) @fastmath {
     foreach(idx; 0 .. N*N) {
         float usqr = ux[idx] * ux[idx] + uy[idx] * uy[idx];

         foreach(q; 0 .. 9) {
             float eu = 3.0f * (ex[q] * ux[idx] + ey[q] * uy[idx]);
             float tmp = 1.0f + eu + 0.5f * eu * eu - 1.5f * usqr;
             tmp *= w[q] * rho[idx];
             neq[idx * 9 + q] = tmp;
         }
     }
}

It may not make any difference since it is semantically the same but I 
thought at the very least rewriting it to be a bit more idiomatic may help.