vectorization of a simple loop -- not in DMD?

Thu Jul 14 08:59:22 UTC 2022

On Monday, 11 July 2022 at 18:15:16 UTC, Ivan Kazmenko wrote:
> Hi.
>
> I'm looking at the compiler output of DMD (-O -release), LDC 
> (-O -release), and GDC (-O3) for a simple array operation:
>
> ```
> void add1 (int [] a)
> {
>     foreach (i; 0..a.length)
>         a[i] += 1;
> }
> ```
>
> Here are the outputs: https://godbolt.org/z/GcznbjEaf
>
> From what I gather at the view linked above, DMD does not use 
> XMM registers for speedup, and does not unroll the loop either. 
>  Switching between 32bit and 64bit doesn't help either.  
> However, I recall in the past it was capable of at least some 
> of these optimizations.  So, how do I enable them for such a 
> function?
>
> Ivan Kazmenko.

No, not in DMD. DMD generates what looks like 32 bit code adapted 
to x86_64.
LDC may optimize this kind of loop with a tri-way branch 
depending on how many array elements remain. but it can both 
generate very good loop code(particularly when AVX-512 is 
available and the struct/data arrangement in memory is 
unfavorable for SIMD) and very questionable code.
You may be losing performance for obscure reasons that look like 
gnomes decided to steal your precious cpu cycles and when that 
happens there is no way to fix it other than manually going in 
with a disassembler/debugger, changing defect optimizations in 
hot code paths to something faster then save back to executable 
file.(yikes, i know.)