Good dotProduct
bearophile
bearophileHUGS at lycos.com
Tue Jun 29 17:50:16 PDT 2010
A version for floats. A version for reals can't be done with SSE* registers.
This loop is unrolled two times, and each SSE2 register keeps 4 floats, so it performs 8 mul+add each cycle. Again this code is slower for shorter arrays, but not much.
A version of the code with no unrolling (that performs only 4 mul+add each cycle) is a little better for shorter arrays (to create it you just need to change UNROLL_MASK to 0b11, remove all the operations on XMM2 and XMM3 and add only 16 to EDX each loop).
The asserts assert((cast(size_t)... can be replaced by a loop that performs the unaligned muls+adds and then changes len, a_ptr and b_ptr to the remaining aligned ones.
float dot(float[] a, float[] b) {
enum size_t UNROLL_MASK = 0b111;
assert(a.length == b.length, "dot(): the two array lengths differ.");
typeof(a[0]) tot = void;
auto a_ptr = a.ptr;
auto b_ptr = b.ptr;
assert((cast(size_t)a_ptr & cast(size_t)0b1111) == 0,
"dot(): the first array is not aligned to 16 bytes.");
assert((cast(size_t)b_ptr & cast(size_t)0b1111) == 0,
"dot(): the second array is not aligned to 16 bytes.");
size_t len = (a.length & (~UNROLL_MASK)) * a[0].sizeof;
if (len) {
asm {
mov EAX, a_ptr;
mov ECX, b_ptr;
mov EDI, len;
xorps XMM0, XMM0;
xorps XMM3, XMM3;
xor EDX, EDX;
align 8;
LOOP_START:
movaps XMM1, [EAX + EDX];
mulps XMM1, [ECX + EDX];
movaps XMM2, [EAX + EDX + 16];
mulps XMM2, [ECX + EDX + 16];
add EDX, 32;
addps XMM0, XMM1;
cmp EDI, EDX;
addps XMM3, XMM2;
jne LOOP_START;
addps XMM0, XMM3;
// XMM0[0] = XMM0[0] + XMM0[1] + XMM0[2] + XMM0[3]
movhlps XMM1, XMM0;
addps XMM0, XMM1;
pshufd XMM1, XMM0, 0b01_01_01_01;
addss XMM0, XMM1;
movss tot, XMM0;
}
} else
tot = 0.0;
size_t left = a.length & UNROLL_MASK;
for (size_t i = left; i > 0; i--)
tot += a[$ - i] * b[$ - i];
return tot;
}
On the same Celeron it takes about 0.88 ticks for each mul+add when the two arrays are 1400 floats long (they fit in L1 cache, if they gets longer and don't fit in L1 it slows down a little). I think better code or better CPUs can go as low as 0.5, but I and/or my CPU aren't that good yet. Its usage of cache lines is not optimal, I think. I don't think cache prefetching instructions can help this code.
Bye,
bearophile
More information about the Digitalmars-d
mailing list