[Issue 16605] New: core.simd generates slow/irrelevant code

Sat Oct 8 03:04:44 PDT 2016

https://issues.dlang.org/show_bug.cgi?id=16605

          Issue ID: 16605
           Summary: core.simd generates slow/irrelevant code
           Product: D
           Version: D2
          Hardware: x86_64
                OS: Linux
            Status: NEW
          Severity: enhancement
          Priority: P1
         Component: dmd
          Assignee: nobody at puremagic.com
          Reporter: malte.kiessling at mkalte.me

I tried working with core.simd. I noticed that (at least for trivial operations
like +=, *= etc) the generated code is kinda slow (slower than wihout SSE
instructions!). I used asm.dlang.org to get these results (using the newest
dmd) below. 

This code:  

****
import core.simd;

void doStuff()
{
     float4 x = [1.0,0.4,1234.0,124.0]; 
      float4 y = [1.0,0.4,1234.0,124.0]; 
      float4 z = [1.0,0.4,1234.0,123.0];
  for(long i = 0; i<1_000_000; i++) {
    x += y;
    x += z;
    z += x;
  }
}
****

Results in the following Assembly (i only pasted the function)
****
void example.doStuff():
 push   rbp
 mov    rbp,rsp
 sub    rsp,0x40
 movaps xmm0,XMMWORD PTR [rip+0x0]        # f <void example.doStuff()+0xf>
 movaps XMMWORD PTR [rbp-0x40],xmm0
 movaps xmm1,XMMWORD PTR [rip+0x0]        # 1a <void example.doStuff()+0x1a>
 movaps XMMWORD PTR [rbp-0x30],xmm1
 movaps xmm2,XMMWORD PTR [rip+0x0]        # 25 <void example.doStuff()+0x25>
 movaps XMMWORD PTR [rbp-0x20],xmm2
 mov    QWORD PTR [rbp-0x10],0x0

 cmp    QWORD PTR [rbp-0x10],0xf4240

 jge    6e <void example.doStuff()+0x6e>
 movaps xmm3,XMMWORD PTR [rbp-0x30]
 movaps xmm4,XMMWORD PTR [rbp-0x40]
 addps  xmm4,xmm3
 movaps XMMWORD PTR [rbp-0x40],xmm4
 movaps xmm0,XMMWORD PTR [rbp-0x20]
 movaps xmm1,XMMWORD PTR [rbp-0x40]
 addps  xmm1,xmm0
 movaps XMMWORD PTR [rbp-0x40],xmm1
 movaps xmm2,XMMWORD PTR [rbp-0x40]
 movaps xmm3,XMMWORD PTR [rbp-0x20]
 addps  xmm3,xmm2
 movaps XMMWORD PTR [rbp-0x20],xmm3
 inc    QWORD PTR [rbp-0x10]
 jmp    31 <void example.doStuff()+0x31>
 leave  
 ret    
****

The most importand thing here is in the body of the for-loop: 
****
    x += y;
    x += z;
    z += x;
****

Becomes

****
 movaps xmm3,XMMWORD PTR [rbp-0x30]
 movaps xmm4,XMMWORD PTR [rbp-0x40]
 addps  xmm4,xmm3
 movaps XMMWORD PTR [rbp-0x40],xmm4
 movaps xmm0,XMMWORD PTR [rbp-0x20]
 movaps xmm1,XMMWORD PTR [rbp-0x40]
 addps  xmm1,xmm0
 movaps XMMWORD PTR [rbp-0x40],xmm1
 movaps xmm2,XMMWORD PTR [rbp-0x40]
 movaps xmm3,XMMWORD PTR [rbp-0x20]
 addps  xmm3,xmm2
 movaps XMMWORD PTR [rbp-0x20],xmm3
****

Insted of 
****
addps xmm0,xmm1
addps xmm0,xmm2
addps xmm2,xmm0
****

So the results of the calculation are put back into memory at each loop
iteration insted of moving them into the xmm registers beforehand and storing
them back afterwards. 
Also, in the beginning the value of the float4 is stored into xmm0-2. Insted of
being used inside the loop, this assignment is ignored inside of the loop and
only used for the copy into the array.  

The result of this is that the generated code runs slower than the manual
operation on an array instead of being a significant speedup.

--