Suboptimal dynamic array operands
z
z at z.com
Mon Jun 21 01:12:33 UTC 2021
For reference : https://run.dlang.io/is/FULu3x (select LDC, click
ASM and ctrl+f and type main to find the relevant code)
(compiler options are `-O -mcpu=native -enable-no-infs-fp-math
-enable-unsafe-fp-math -enable-ipra -tailcallopt -release`)
When performing `a[] *op*= b[]` or `foreach(i, aa; a){a[i] -=
b[i]}` operations, LDC generates slower code than it should.
The generated code appears to always be a switch loop which
operates on packets of 32, 4 or 1 values(the size of the packets
varies program to program) and jumps to the appropriate case
depending on the remaining number of values to operate on.
The code for the 32 and 1-sized packets is ok in my program, but
the middle in-between size(4 here, although i've seen it do it
with 8) always uses unrolled `v*op*ss` instead of the packed
versions(`VSUBPS` here).
In my testing, naively modifying the code with the appropriate
SIMD equivalent through a debugger and jumping to the end of the
switch case causes observable performance gain(5-10% total
program time in the worst case where the array's .length is < 32)
I'm not sure if this is related, but i've also seen code output
where the faulty case kept doing redundant register loads as if
it was the first iteration.
An example:
```nasm
mov rsi, [rsp+30]
mov rdi, [rsp+78]
mov rsi, [rsi+138]
vmovss xmm0, [rdi+rdx*4]
vsubss xmm0, xmm0, [rsi+rdx*4]
vmovss [rdi+rdx*4], xmm0
;//repeat this for 3 more iterations, the pointer loads are the
exact sames while the floats use an +(unroll_i*float.sizeof)
offset
```
So my question is how can i get LDC/LLVM to generate the proper
code?
Thanks.
More information about the digitalmars-d-ldc
mailing list