[Issue 16605] New: core.simd generates slow/irrelevant code
via Digitalmars-d-bugs
digitalmars-d-bugs at puremagic.com
Sat Oct 8 03:04:44 PDT 2016
https://issues.dlang.org/show_bug.cgi?id=16605
Issue ID: 16605
Summary: core.simd generates slow/irrelevant code
Product: D
Version: D2
Hardware: x86_64
OS: Linux
Status: NEW
Severity: enhancement
Priority: P1
Component: dmd
Assignee: nobody at puremagic.com
Reporter: malte.kiessling at mkalte.me
I tried working with core.simd. I noticed that (at least for trivial operations
like +=, *= etc) the generated code is kinda slow (slower than wihout SSE
instructions!). I used asm.dlang.org to get these results (using the newest
dmd) below.
This code:
****
import core.simd;
void doStuff()
{
float4 x = [1.0,0.4,1234.0,124.0];
float4 y = [1.0,0.4,1234.0,124.0];
float4 z = [1.0,0.4,1234.0,123.0];
for(long i = 0; i<1_000_000; i++) {
x += y;
x += z;
z += x;
}
}
****
Results in the following Assembly (i only pasted the function)
****
void example.doStuff():
push rbp
mov rbp,rsp
sub rsp,0x40
movaps xmm0,XMMWORD PTR [rip+0x0] # f <void example.doStuff()+0xf>
movaps XMMWORD PTR [rbp-0x40],xmm0
movaps xmm1,XMMWORD PTR [rip+0x0] # 1a <void example.doStuff()+0x1a>
movaps XMMWORD PTR [rbp-0x30],xmm1
movaps xmm2,XMMWORD PTR [rip+0x0] # 25 <void example.doStuff()+0x25>
movaps XMMWORD PTR [rbp-0x20],xmm2
mov QWORD PTR [rbp-0x10],0x0
cmp QWORD PTR [rbp-0x10],0xf4240
jge 6e <void example.doStuff()+0x6e>
movaps xmm3,XMMWORD PTR [rbp-0x30]
movaps xmm4,XMMWORD PTR [rbp-0x40]
addps xmm4,xmm3
movaps XMMWORD PTR [rbp-0x40],xmm4
movaps xmm0,XMMWORD PTR [rbp-0x20]
movaps xmm1,XMMWORD PTR [rbp-0x40]
addps xmm1,xmm0
movaps XMMWORD PTR [rbp-0x40],xmm1
movaps xmm2,XMMWORD PTR [rbp-0x40]
movaps xmm3,XMMWORD PTR [rbp-0x20]
addps xmm3,xmm2
movaps XMMWORD PTR [rbp-0x20],xmm3
inc QWORD PTR [rbp-0x10]
jmp 31 <void example.doStuff()+0x31>
leave
ret
****
The most importand thing here is in the body of the for-loop:
****
x += y;
x += z;
z += x;
****
Becomes
****
movaps xmm3,XMMWORD PTR [rbp-0x30]
movaps xmm4,XMMWORD PTR [rbp-0x40]
addps xmm4,xmm3
movaps XMMWORD PTR [rbp-0x40],xmm4
movaps xmm0,XMMWORD PTR [rbp-0x20]
movaps xmm1,XMMWORD PTR [rbp-0x40]
addps xmm1,xmm0
movaps XMMWORD PTR [rbp-0x40],xmm1
movaps xmm2,XMMWORD PTR [rbp-0x40]
movaps xmm3,XMMWORD PTR [rbp-0x20]
addps xmm3,xmm2
movaps XMMWORD PTR [rbp-0x20],xmm3
****
Insted of
****
addps xmm0,xmm1
addps xmm0,xmm2
addps xmm2,xmm0
****
So the results of the calculation are put back into memory at each loop
iteration insted of moving them into the xmm registers beforehand and storing
them back afterwards.
Also, in the beginning the value of the float4 is stored into xmm0-2. Insted of
being used inside the loop, this assignment is ignored inside of the loop and
only used for the copy into the array.
The result of this is that the generated code runs slower than the manual
operation on an array instead of being a significant speedup.
--
More information about the Digitalmars-d-bugs
mailing list