SIMD implementation of dot-product. Benchmarks
Manu
turkeyman at gmail.com
Sat Aug 17 22:25:48 PDT 2013
movups is not good. It'll be a lot faster (and portable) if you use movaps.
Process looks something like:
* do the first few from a[0] until a's alignment interval as scalar
* load the left of b's aligned pair
* loop for each aligned vector in a
- load a[n..n+4] aligned
- load the right of b's pair
- combine left~right and shift left to match elements against a
- left = right
* perform stragglers as scalar
Your benchmark is probably misleading too, because I suspect you are
passing directly alloc-ed arrays into the function (which are 16 byte
aligned).
movups will be significantly slower if the pointers supplied are not 16
byte aligned.
Also, results vary significantly between chip manufacturers and revisions.
On 18 August 2013 14:55, Ilya Yaroshenko <ilyayaroshenko at gmail.com> wrote:
> On Saturday, 17 August 2013 at 19:38:52 UTC, John Colvin wrote:
>
>> On Saturday, 17 August 2013 at 19:24:52 UTC, Ilya Yaroshenko wrote:
>>
>>> BTW: -march=native automatically implies -mtune=native
>>>>
>>>
>>> Thanks, I`ll remove mtune)
>>>
>>
>> It would be really interesting if you could try writing the same code in
>> c, both a scalar version and a version using gcc's vector instrinsics, to
>> allow us to compare performance and identify areas for D to improve.
>>
>
> I am lazy )
>
> I have looked at assembler code:
>
> float, scalar (main loop):
> .L191:
> vmovss xmm1, DWORD PTR [rsi+rax*4]
> vfmadd231ss xmm0, xmm1, DWORD PTR [rcx+rax*4]
> add rax, 1
> cmp rax, rdi
> jne .L191
>
>
> float, vector (main loop):
> .L2448:
> vmovups ymm5, YMMWORD PTR [rax]
> sub rax, -128
> sub r11, -128
> vmovups ymm4, YMMWORD PTR [r11-128]
> vmovups ymm6, YMMWORD PTR [rax-96]
> vmovups ymm7, YMMWORD PTR [r11-96]
> vfmadd231ps ymm3, ymm5, ymm4
> vmovups ymm8, YMMWORD PTR [rax-64]
> vmovups ymm9, YMMWORD PTR [r11-64]
> vfmadd231ps ymm0, ymm6, ymm7
> vmovups ymm10, YMMWORD PTR [rax-32]
> vmovups ymm11, YMMWORD PTR [r11-32]
> cmp rdi, rax
> vfmadd231ps ymm2, ymm8, ymm9
> vfmadd231ps ymm1, ymm10, ymm11
> ja .L2448
>
> float, vector (full):
> https://gist.github.com/9il/**6258443<https://gist.github.com/9il/6258443>
>
>
> It is pretty optimized)
>
>
> ____
> Best regards
>
> Ilya
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/digitalmars-d-announce/attachments/20130818/2ac5ace1/attachment.html>
More information about the Digitalmars-d-announce
mailing list