Performance issue with @fastmath and vectorization
dextorious via digitalmars-d-ldc
digitalmars-d-ldc at puremagic.com
Fri Nov 11 16:03:16 PST 2016
As part of slowly learning the basics of programming in D, I
ported some of my fluid dynamics code from C++ to D and quickly
noticed a rather severe performance degradation by a factor of
2-3x. I've narrowed it down to a simple representative benchmark
of virtually identical C++ and D code.
The D version: http://pastebin.com/Rs9CUA5j
The C++ code: http://pastebin.com/XzStHXA2
I compile the D code using the latest beta release on GitHub,
using the compiler switches -release -O5 -mcpu=haswell
-boundscheck=off. The C++ version is compiled using Clang 3.9.0
with the switches -std=c++14 -Ofast -fno-exceptions -fno-rtti
-flto -ffast-math -march=native, which is my usual configuration
for numerical code.
On my Haswell i7-4710HQ machine the C++ version runs in
~10ms/iteration while the D code takes 25ms. Comparing profiler
output with the generated assembly code quickly reveals the
reason - while Clang fully unrolls the inner loop and uses FMA
instructions wherever possible, the inner loop assembly produced
by LDC looks like this:
0.24 │6c0: vmovss (%r15,%rbp,4),%xmm4
1.03 │ vmovss (%r12,%rbp,4),%xmm5
3.51 │ add $0x4,%rdi
6.96 │ add $0x4,%rax
1.04 │6d4: vmulss (%rax,%rcx,1),%xmm4,%xmm4
4.66 │ vmulss (%rax,%rdx,1),%xmm5,%xmm5
8.44 │ vaddss %xmm4,%xmm5,%xmm4
1.09 │ vmulss %xmm0,%xmm4,%xmm5
3.73 │ vmulss %xmm4,%xmm5,%xmm4
7.48 │ vsubss %xmm3,%xmm4,%xmm4
1.13 │ vmulss %xmm1,%xmm4,%xmm4
2.00 │ vaddss %xmm2,%xmm5,%xmm5
3.46 │ vmovss 0x0(%r13,%rbp,4),%xmm6
7.85 │ vmulss (%rax,%rsi,1),%xmm6,%xmm6
2.50 │ vaddss %xmm4,%xmm5,%xmm4
6.49 │ vmulss %xmm4,%xmm6,%xmm4
25.48 │ vmovss %xmm4,(%rdi)
8.26 │ cmp $0x20,%rax
0.00 │ ↑ jne 6c0
Am I doing something blatantly wrong here or have I run into a
compiler limitation? Is there anything short of using intrinsics
or calling C/C++ code I can do here to get to performance parity?
Also, while on the subject, is there a way to force LDC to apply
the relaxed floating point model to the entire program, rather
than individual functions (the equivalent of --fast-math)?
More information about the digitalmars-d-ldc
mailing list