SIMD implementation of dot-product. Benchmarks

Sat Aug 17 22:31:25 PDT 2013

On Sunday, 18 August 2013 at 05:26:00 UTC, Manu wrote:
> movups is not good. It'll be a lot faster (and portable) if you 
> use movaps.
>
> Process looks something like:
>   * do the first few from a[0] until a's alignment interval as 
> scalar
>   * load the left of b's aligned pair
>   * loop for each aligned vector in a
>     - load a[n..n+4] aligned
>     - load the right of b's pair
>     - combine left~right and shift left to match elements 
> against a
>     - left = right
>   * perform stragglers as scalar
>
> Your benchmark is probably misleading too, because I suspect 
> you are
> passing directly alloc-ed arrays into the function (which are 
> 16 byte
> aligned).
> movups will be significantly slower if the pointers supplied 
> are not 16
> byte aligned.
> Also, results vary significantly between chip manufacturers and 
> revisions.

I`ll try =). Thanks you very math!