Performance issue with @fastmath and vectorization

dextorious via digitalmars-d-ldc digitalmars-d-ldc at puremagic.com
Fri Nov 11 16:03:16 PST 2016


As part of slowly learning the basics of programming in D, I 
ported some of my fluid dynamics code from C++ to D and quickly 
noticed a rather severe performance degradation by a factor of 
2-3x. I've narrowed it down to a simple representative benchmark 
of virtually identical C++ and D code.

The D version: http://pastebin.com/Rs9CUA5j
The C++ code:  http://pastebin.com/XzStHXA2

I compile the D code using the latest beta release on GitHub, 
using the compiler switches -release -O5 -mcpu=haswell 
-boundscheck=off. The C++ version is compiled using Clang 3.9.0 
with the switches -std=c++14 -Ofast -fno-exceptions -fno-rtti 
-flto -ffast-math -march=native, which is my usual configuration 
for numerical code.

On my Haswell i7-4710HQ machine the C++ version runs in 
~10ms/iteration while the D code takes 25ms. Comparing profiler 
output with the generated assembly code quickly reveals the 
reason - while Clang fully unrolls the inner loop and uses FMA 
instructions wherever possible, the inner loop assembly produced 
by LDC looks like this:

   0.24 │6c0:   vmovss (%r15,%rbp,4),%xmm4
   1.03 │       vmovss (%r12,%rbp,4),%xmm5
   3.51 │       add    $0x4,%rdi
   6.96 │       add    $0x4,%rax
   1.04 │6d4:   vmulss (%rax,%rcx,1),%xmm4,%xmm4
   4.66 │       vmulss (%rax,%rdx,1),%xmm5,%xmm5
   8.44 │       vaddss %xmm4,%xmm5,%xmm4
   1.09 │       vmulss %xmm0,%xmm4,%xmm5
   3.73 │       vmulss %xmm4,%xmm5,%xmm4
   7.48 │       vsubss %xmm3,%xmm4,%xmm4
   1.13 │       vmulss %xmm1,%xmm4,%xmm4
   2.00 │       vaddss %xmm2,%xmm5,%xmm5
   3.46 │       vmovss 0x0(%r13,%rbp,4),%xmm6
   7.85 │       vmulss (%rax,%rsi,1),%xmm6,%xmm6
   2.50 │       vaddss %xmm4,%xmm5,%xmm4
   6.49 │       vmulss %xmm4,%xmm6,%xmm4
  25.48 │       vmovss %xmm4,(%rdi)
   8.26 │       cmp    $0x20,%rax
   0.00 │     ↑ jne    6c0

Am I doing something blatantly wrong here or have I run into a 
compiler limitation? Is there anything short of using intrinsics 
or calling C/C++ code I can do here to get to performance parity?

Also, while on the subject, is there a way to force LDC to apply 
the relaxed floating point model to the entire program, rather 
than individual functions (the equivalent of --fast-math)?


More information about the digitalmars-d-ldc mailing list