<div class="gmail_quote">On 16 January 2012 19:01, Timon Gehr <span dir="ltr"><<a href="mailto:timon.gehr@gmx.ch">timon.gehr@gmx.ch</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="im">On 01/16/2012 05:59 PM, Manu wrote:<br>
</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">
On 16 January 2012 18:48, Andrei Alexandrescu<br></div>
<<a href="mailto:SeeWebsiteForEmail@erdani.org" target="_blank">SeeWebsiteForEmail@erdani.org</a> <mailto:<a href="mailto:SeeWebsiteForEmail@erdani.org" target="_blank">SeeWebsiteForEmail@<u></u>erdani.org</a>>><div>
<div class="h5"><br>
wrote:<br>
<br>
On 1/16/12 10:46 AM, Manu wrote:<br>
<br>
A function using float arrays and a function using hardware vectors<br>
should certainly not be the same speed.<br>
<br>
<br>
My point was that the version using float arrays should<br>
opportunistically use hardware ops whenever possible.<br>
<br>
<br></div></div><div class="im">
I think this is a mistake, because such a piece of code never exists<br>
outside of some context. If the context it exists within is all FPU code<br>
(and it is, it's a float array), then swapping between FPU and SIMD<br>
execution units will probably result in the function being slower than<br>
the original (also the float array is unaligned). The SIMD version<br>
however must exist within a SIMD context, since the API can't implicitly<br>
interact with floats, this guarantees that the context of each function<br>
matches that within which it lives.<br>
This is fundamental to fast vector performance. Using SIMD is an all or<br>
nothing decision, you can't just mix it in here and there.<br>
You don't go casting back and fourth between floats and ints on every<br>
other line... obviously it's imprecise, but it's also a major<br>
performance hazard. There is no difference here, except the performance<br>
hazard is much worse.<br>
</div></blockquote>
<br>
I think DMD now uses XMM registers for scalar floating point arithmetic on x86_64.<br>
</blockquote></div><br><div>x64 can do the swapping too with no penalty, but that is the only architecture that can. So it might be a viable x64 optimisation, but only for x64 codegen, which means any tech to detect and apply the optimisation should live in the back end, not in the front end as a higher level semantic.</div>