Good dotProduct

Tue Jun 29 18:17:09 PDT 2010

bearophile wrote:
> A version for floats. A version for reals can't be done with SSE* registers.
> This loop is unrolled two times, and each SSE2 register keeps 4 floats, so it performs 8 mul+add each cycle. Again this code is slower for shorter arrays, but not much.
> 
> A version of the code with no unrolling (that performs only 4 mul+add each cycle) is a little better for shorter arrays (to create it you just need to change UNROLL_MASK to 0b11, remove all the operations on XMM2 and XMM3 and add only 16 to EDX each loop).
> 
> The asserts assert((cast(size_t)... can be replaced by a loop that performs the unaligned muls+adds and then changes len, a_ptr and b_ptr to the remaining aligned ones.

You already have a loop at the end that takes care of the stray 
elements. Why not move it to the beginning to take care of the stray 
elements _and_ unaligned elements in one shot?

Andrei