<div dir="ltr">You should probably watch my talk again ;)<div style>Most of the points I make towards the end when I make the claim "almost everyone who tries to use SIMD will see the same or slower performance, and the reason is they have simply revealed other bottlenecks".</div>

<div style>And I also made the point "only by strictly applying ALL of the points I demonstrated, will you see significant performance improvement".</div><div style><br></div><div style>The problem with your code is that it doesn't do any real work. Your operations are all dependent on the result of the previous operation. The scalar operations have a shorter latency than the SIMD operations, and they all execute in parallel.</div>

<div style>This is exactly the pathological worst-case comparison that basically everyone new to SIMD tries to write and wonders why it's slow.</div><div style>I guess I should have demonstrated this point more clearly in my talk. It was very rushed (actually, the script was basically on the spot), sorry about that!</div>

<div style><br></div><div style>There's not enough code in those loops. You're basically profiling loop iteration performance and the latency of a float opcode vs a simd opcode... not any significant work.</div><div style>

You should see a big difference if you unroll the loop 4-8 times (or more for such a short loop, depending on the CPU).</div><div style>I also made the point that you should always avoid doing SIMD profiling on an x86, and certainly not an x64, since it is both, the most forgiving (results in the least wins of any arch), and also the hardest to predict; the performance difference you see will almost certainly not be the same on someone else's chip..</div>

<div style><br></div><div style>Look again to my points about latency, reducing the overall pipeline length (demonstrated with the addition sequence), and unrolling the loops.</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">

On 30 June 2013 06:34, Jonathan Dunlap <span dir="ltr"><<a href="mailto:jadit2@gmail.com" target="_blank">jadit2@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

I did watch Manu's a few days ago which inspired me to start this project. With the updates in <a href="http://dpaste.dzfl.pl/fce2d93b" target="_blank">http://dpaste.dzfl.pl/fce2d93b</a><u></u>, I'm still a bit clueless as to why there is almost zero performance difference... considering that is seems like an ideal setup to benefit from SIMD. I feel that if I can't see gains here: that I shouldn't bother using them in practice, where sometimes non-ideal operations must be done.<br>


</blockquote></div><br></div>