Suboptimal dynamic array operands

Mon Jun 21 02:41:43 UTC 2021

On Monday, 21 June 2021 at 01:12:33 UTC, z wrote:
> When performing `a[] *op*= b[]` or `foreach(i, aa; a){a[i] -= 
> b[i]}` operations, LDC generates slower code than it should.
>
> The generated code appears to always be a switch loop which 
> operates on packets of 32, 4 or 1 values(the size of the 
> packets varies program to program) and jumps to the appropriate 
> case depending on the remaining number of values to operate on.
> The code for the 32 and 1-sized packets is ok in my program, 
> but the middle in-between size(4 here, although i've seen it do 
> it with 8) always uses unrolled `v*op*ss` instead of the packed 
> versions(`VSUBPS` here).

It looks like both loop unrolling and auto-vectorization expect a 
higher iteration/element count by default, and LDC currently 
doesn't have a way to fine-tune these parameters on a 
per-function/loop basis (e.g., via pragmas), only global LLVM 
cmdline options.

@restrict doesn't help much either here to tell the optimizer the 
opaque slices don't overlap (opaque because the GC allocation is 
an opaque druntime function call).

Using vector types explicitly improves things but imposes 
restrictions on lengths and alignment.

See https://d.godbolt.org/z/r746o3Ya5 for boilerplate-free 
assembly.