auto vectorization notes

Mon Mar 23 18:52:16 UTC 2020

When speeds are equivalent, or very close, I usually prefer auto 
vectorized code to explicit SIMD/__vector code as it's easier to 
read.  (on the downside you have to guard against compiler 
code-gen performance regressions)

One oddity I've noticed is that I sometimes need to use 
pragma(inline, *false*) in order to get ldc to "do the right 
thing". Apparently the compiler sees the costs/benefits 
differently in the standalone context.

More widely known techniques that have gotten people over the 
serial/SIMD hump include:
  1) simplified indexing relationships
  2) known count inner loops (chunkify)
  3) static foreach blocks (manual inlining that the compiler 
"gathers")

I'd be interested to hear from others regarding their auto 
vectorization and __vector experiences.  What has worked and what 
hasn't worked in your performance sensitive dlang code?