SIMD implementation of dot-product. Benchmarks

Sat Aug 17 21:39:09 PDT 2013

On Sunday, 18 August 2013 at 01:53:53 UTC, Manu wrote:
> It doesn't look like you account for alignment.
> This is basically not-portable (I doubt unaligned loads in this 
> context are
> faster than performing scalar operations), and possibly 
> inefficient on x86
> too.

dotProduct uses unaligned loads (__builtin_ia32_loadups256, 
__builtin_ia32_loadupd256) and it up to 21 times faster then 
trivial scalar version.

Why unaligned loads is not-portable and inefficient?

> To make it account for potentially random alignment will be 
> awkward, but it
> might be possible to do efficiently.

Did you mean use unaligned loads or prepare data for alignment 
loads at the beginning of function?