Optimizing for SIMD: best practices?(i.e. what features are allowed?)

Sun Mar 7 23:06:30 UTC 2021

On Sunday, 7 March 2021 at 18:00:57 UTC, z wrote:
> On Friday, 26 February 2021 at 03:57:12 UTC, tsbockman wrote:
>>>>   static foreach(size_t i; 0 .. 3/+typeof(a).length+/){
>>>>       distance += a[i].abs;//abs required by the caller
>>
>> (a * a) above is always positive for real numbers. You don't 
>> need the call to abs unless you're trying to guarantee that 
>> even nan values will have a clear sign bit.
>>
> I do not know why but the caller's performance nosedives

My way is definitely (slightly) better; something is going wrong 
in either the caller, or the optimizer. Show me the code for the 
caller and maybe I can figure it out.

> whenever there is no .abs at this particular line.(there's a 3x 
> difference, no joke.)

Perhaps the compiler is performing a value range propagation 
(VRP) based optimization in the caller, but isn't smart enough to 
figure out that the value is already always positive without the 
`abs` call? I've run into that specific problem before.

Alternatively, sometimes trivial changes to the code that 
*shouldn't* matter make the difference between hitting a smart 
path in the optimizer, and a dumb one. Automatic SIMD 
optimization is quite sensitive and temperamental.

Either way, the problem can be fixed by figuring out what 
optimization the compiler is doing when it knows that distance is 
positive, and just doing it manually.

> Same for assignment instead of addition, but with a 2x 
> difference instead.

Did you fix the nan bug I pointed out earlier? More generally, 
are you actually verifying the correctness of the results in any 
way for each alternative implementation? Because you can get big 
speedups sometimes from buggy code when the compiler realizes 
that some later logic error makes earlier code irrelevant, but 
that doesn't mean the buggy code is better...