<div dir="ltr"><div>movups is not good. It'll be a lot faster (and portable) if you use movaps.<br></div><div><br></div><div>Process looks something like:</div><div>  * do the first few from a[0] until a's alignment interval as scalar</div>

<div>  * load the left of b's aligned pair</div><div>  * loop for each aligned vector in a<br></div><div>    - load a[n..n+4] aligned</div><div>    - load the right of b's pair</div><div>    - combine left~right and shift left to match elements against a</div>

<div>    - left = right<br></div><div>  * perform stragglers as scalar</div><div><br></div><div>Your benchmark is probably misleading too, because I suspect you are passing directly alloc-ed arrays into the function (which are 16 byte aligned).</div>

<div>movups will be significantly slower if the pointers supplied are not 16 byte aligned.</div><div>Also, results vary significantly between chip manufacturers and revisions.</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">

On 18 August 2013 14:55, Ilya Yaroshenko <span dir="ltr"><<a href="mailto:ilyayaroshenko@gmail.com" target="_blank">ilyayaroshenko@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="HOEnZb"><div class="h5">On Saturday, 17 August 2013 at 19:38:52 UTC, John Colvin wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

On Saturday, 17 August 2013 at 19:24:52 UTC, Ilya Yaroshenko wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

BTW: -march=native automatically implies -mtune=native<br>

</blockquote>

<br>

Thanks, I`ll remove mtune)<br>

</blockquote>

<br>

It would be really interesting if you could try writing the same code in c, both a scalar version and a version using gcc's vector instrinsics, to allow us to compare performance and identify areas for D to improve.<br>


</blockquote>

<br></div></div>

I am lazy )<br>

<br>

I have looked at assembler code:<br>

<br>

float, scalar (main loop):<br>

.L191:<br>

        vmovss  xmm1, DWORD PTR [rsi+rax*4]<br>

        vfmadd231ss     xmm0, xmm1, DWORD PTR [rcx+rax*4]<br>

        add     rax, 1<br>

        cmp     rax, rdi<br>

        jne     .L191<br>

<br>

<br>

float, vector (main loop):<br>

.L2448:<br>

        vmovups ymm5, YMMWORD PTR [rax]<br>

        sub     rax, -128<br>

        sub     r11, -128<br>

        vmovups ymm4, YMMWORD PTR [r11-128]<br>

        vmovups ymm6, YMMWORD PTR [rax-96]<br>

        vmovups ymm7, YMMWORD PTR [r11-96]<br>

        vfmadd231ps     ymm3, ymm5, ymm4<br>

        vmovups ymm8, YMMWORD PTR [rax-64]<br>

        vmovups ymm9, YMMWORD PTR [r11-64]<br>

        vfmadd231ps     ymm0, ymm6, ymm7<br>

        vmovups ymm10, YMMWORD PTR [rax-32]<br>

        vmovups ymm11, YMMWORD PTR [r11-32]<br>

        cmp     rdi, rax<br>

        vfmadd231ps     ymm2, ymm8, ymm9<br>

        vfmadd231ps     ymm1, ymm10, ymm11<br>

        ja      .L2448<br>

<br>

float, vector (full):<br>

        <a href="https://gist.github.com/9il/6258443" target="_blank">https://gist.github.com/9il/<u></u>6258443</a><br>

<br>

<br>

It is pretty optimized)<br>

<br>

<br>

____<br>

Best regards<span class="HOEnZb"><font color="#888888"><br>

<br>

Ilya<br>

<br>

</font></span></blockquote></div><br></div>