SIMD implementation of dot-product. Benchmarks
Ilya Yaroshenko
ilyayaroshenko at gmail.com
Sat Aug 24 08:01:38 PDT 2013
On Sunday, 18 August 2013 at 05:26:00 UTC, Manu wrote:
> movups is not good. It'll be a lot faster (and portable) if you
> use movaps.
>
> Process looks something like:
> * do the first few from a[0] until a's alignment interval as
> scalar
> * load the left of b's aligned pair
> * loop for each aligned vector in a
> - load a[n..n+4] aligned
> - load the right of b's pair
> - combine left~right and shift left to match elements
> against a
> - left = right
> * perform stragglers as scalar
>
> Your benchmark is probably misleading too, because I suspect
> you are
> passing directly alloc-ed arrays into the function (which are
> 16 byte
> aligned).
> movups will be significantly slower if the pointers supplied
> are not 16
> byte aligned.
> Also, results vary significantly between chip manufacturers and
> revisions.
I have tried to write fast implementation with aligned loads:
1. I have now idea how to shift (rotate) 32-bytes avx vector
without XOP instruction set (XOP available only for AMD).
2. I have tried to use one vmovaps and [one vmovups]/[two
vinsertf128] with 16-bytes aligned arrays (previously iterates
with a). It works slower then two vmovups (because loop tricks).
Now I have 300 lines of slow dotProduct code =)
4. Condition for small arrays works good.
I think it is better to use:
1. vmovups if it is available with condition for small arrays
2. version like from phobos if vmovups is not avalible
3. special version for small static size arrays
I think version for static size arrays can be easily done for
phobos, processors can unroll such code. And dot product
optimized for complex numbers can be done too.
Best regards
Ilya
More information about the Digitalmars-d-announce
mailing list