Good dotProduct

Tue Jun 29 10:08:48 PDT 2010

Andrei:

>That's pretty cool!<

LLVM compiler can't unroll loops that loop a number of times known at compile time (I have an open LLVM bug report on this), and even when this bug gets fixed, then LLVM isn't able to vectorize yet, so it doesn't use SIMD instructions like movapd, mulpd etc that work with two doubles at the same time (unless you use intrinsics instructions like _mm_load_pd that are currently not available with DMD but can be used with LDC too). So this routine is an improvement even on the asm produced by LDC from D code.

But there are some other things you need to consider:

dot5() works with arrays of doubles only; but it's not too much hard to write a version for floats, if you want I can write it in some time (I am slow in writing asm). A version for reals is probably less important.

dot5() uses the movapd instruction, so it requires the doubles to be aligned to 16 bytes. Currently it just asserts if data is not aligned, but such assert is kind of useless. This problem can be solved in three ways: adding a bit of code (in D or asm) in the beginning to use the first unaligned double if present. This branch slows down code a bit (such slow down can be felt only for small arrays, of course). Otherwise there are other solutions, none of them is perfect. 

Currently in D2 big dynamic arrays are not aligned to 16 bytes, but Steven Schveighoffer will fix it:
http://d.puremagic.com/issues/show_bug.cgi?id=4400

Even when Steven has fixed the alignment to 16 bytes, you just need to slice the array like [1..$] to have an unaligned array. The type system can come to the rescue: an AlignedArray (subtype of array) can be invented that guarantees alignment to 16 bytes even for its slices (so it just refuses to be sliced on odd indexed bounds), then inside the dotProduct() code a static if can test for such array and avoid the runtime test (branch) for 16 byte alignment.

Note that future CPUs can drop the requirement for 16 bytes alignments. If this happens then AlignedArray can become useless.

That dot5() seems to work, but it needs unittests.

The loop of dot5() is unrolled once, and each instruction works on two doubles, so essentially this code is unrolled four times. This means dot5() works only if the length of the input arrays is a multiple of 4. A bit of extra trailing or leading code (in asm or D, but D is probably enough) needs to be added to perform the mul+sum in the few pairs not managed by the main asm loop.

Maybe the D code inside dot5() can be improved a bit.

movapd is a SSE2 instruction, so a SSE2 detection branch can be added, this is very easy, you can test is with the sse2 function from the arraydouble module of the druntime:
private import core.cpuid;
bool sse2() { return cpuid == 3 && core.cpuid.sse2(); }
The small problem is of course that this is a runtime test, this can slow down code a bit when input arrays a small. It's not easy to move this test from runtime to compile-time (to compile a module with different versions of the function... this can require self-modifying code to change the "static" start address of the right function... Some compiler support can improve this situation.

Despite dot5() is almost three times faster than std.numeric.dotProduct for aligned arrays of about 1400 doubles, for small or very small arrays it's slower. So another runtime test (if) can be added inside the function to switch to a different algorithm (probably purely written in D) that works efficiently on very short arrays.

>Do we have your permission to replace dotProduct with yours?<

Of course :-) Improving std.numerics was my purpose from the beginning. But as you see dot5() isn't yet a replacement for dotProduct, some more work needs to be done even if dot() contains no bugs.

Bye,
bearophile