SIMD implementation of dot-product. Benchmarks
Ilya Yaroshenko
ilyayaroshenko at gmail.com
Sat Aug 17 21:55:24 PDT 2013
On Saturday, 17 August 2013 at 19:38:52 UTC, John Colvin wrote:
> On Saturday, 17 August 2013 at 19:24:52 UTC, Ilya Yaroshenko
> wrote:
>>> BTW: -march=native automatically implies -mtune=native
>>
>> Thanks, I`ll remove mtune)
>
> It would be really interesting if you could try writing the
> same code in c, both a scalar version and a version using gcc's
> vector instrinsics, to allow us to compare performance and
> identify areas for D to improve.
I am lazy )
I have looked at assembler code:
float, scalar (main loop):
.L191:
vmovss xmm1, DWORD PTR [rsi+rax*4]
vfmadd231ss xmm0, xmm1, DWORD PTR [rcx+rax*4]
add rax, 1
cmp rax, rdi
jne .L191
float, vector (main loop):
.L2448:
vmovups ymm5, YMMWORD PTR [rax]
sub rax, -128
sub r11, -128
vmovups ymm4, YMMWORD PTR [r11-128]
vmovups ymm6, YMMWORD PTR [rax-96]
vmovups ymm7, YMMWORD PTR [r11-96]
vfmadd231ps ymm3, ymm5, ymm4
vmovups ymm8, YMMWORD PTR [rax-64]
vmovups ymm9, YMMWORD PTR [r11-64]
vfmadd231ps ymm0, ymm6, ymm7
vmovups ymm10, YMMWORD PTR [rax-32]
vmovups ymm11, YMMWORD PTR [r11-32]
cmp rdi, rax
vfmadd231ps ymm2, ymm8, ymm9
vfmadd231ps ymm1, ymm10, ymm11
ja .L2448
float, vector (full):
https://gist.github.com/9il/6258443
It is pretty optimized)
____
Best regards
Ilya
More information about the Digitalmars-d-announce
mailing list