Good dotProduct

Tue Jun 29 17:50:16 PDT 2010

A version for floats. A version for reals can't be done with SSE* registers.
This loop is unrolled two times, and each SSE2 register keeps 4 floats, so it performs 8 mul+add each cycle. Again this code is slower for shorter arrays, but not much.

A version of the code with no unrolling (that performs only 4 mul+add each cycle) is a little better for shorter arrays (to create it you just need to change UNROLL_MASK to 0b11, remove all the operations on XMM2 and XMM3 and add only 16 to EDX each loop).

The asserts assert((cast(size_t)... can be replaced by a loop that performs the unaligned muls+adds and then changes len, a_ptr and b_ptr to the remaining aligned ones.

float dot(float[] a, float[] b) {
    enum size_t UNROLL_MASK = 0b111;
    assert(a.length == b.length, "dot(): the two array lengths differ.");

    typeof(a[0]) tot = void;
    auto a_ptr = a.ptr;
    auto b_ptr = b.ptr;

    assert((cast(size_t)a_ptr & cast(size_t)0b1111) == 0,
           "dot(): the first array is not aligned to 16 bytes.");
    assert((cast(size_t)b_ptr & cast(size_t)0b1111) == 0,
           "dot(): the second array is not aligned to 16 bytes.");

    size_t len = (a.length & (~UNROLL_MASK)) * a[0].sizeof;

    if (len) {
        asm {
            mov EAX, a_ptr;
            mov ECX, b_ptr;
            mov EDI, len;
            xorps XMM0, XMM0;
            xorps XMM3, XMM3;
            xor EDX, EDX;

            align 8;
          LOOP_START:
            movaps XMM1, [EAX + EDX];
            mulps XMM1,  [ECX + EDX];
            movaps XMM2, [EAX + EDX + 16];
            mulps XMM2,  [ECX + EDX + 16];
            add EDX, 32;
            addps XMM0, XMM1;
            cmp EDI, EDX;
            addps XMM3, XMM2;
            jne LOOP_START;

            addps XMM0, XMM3;

            // XMM0[0] = XMM0[0] + XMM0[1] + XMM0[2] + XMM0[3]
            movhlps XMM1, XMM0;
            addps XMM0, XMM1;
            pshufd XMM1, XMM0, 0b01_01_01_01;
            addss XMM0, XMM1;

            movss tot, XMM0;
        }
    } else
        tot = 0.0;

    size_t left = a.length & UNROLL_MASK;
    for (size_t i = left; i > 0; i--)
        tot += a[$ - i] * b[$ - i];
    return tot;
}

On the same Celeron it takes about 0.88 ticks for each mul+add when the two arrays are 1400 floats long (they fit in L1 cache, if they gets longer and don't fit in L1 it slows down a little). I think better code or better CPUs can go as low as 0.5, but I and/or my CPU aren't that good yet. Its usage of cache lines is not optimal, I think. I don't think cache prefetching instructions can help this code.

Bye,
bearophile