optimized array operations

Wed Sep 24 14:12:00 PDT 2008

Don Wrote:

> Eugene Pelekhay wrote:
> > I'm finished optimized version of array operations, using SSE2 instructions.
> 
> Good work. Note that your code will actually work better for floats 
> (using SSE) than with SSE2.
> 
> As far as I can tell, X87 code may actually be faster for the unaligned 
> case.
> 
> Comparing x87 code
> 
> fld mem
> fadd mem
> fstp mem
> 
> with SSE code
> 
> movapd/movupd  reg, mem
> addpd reg, mem
> movapd/movupd mem, reg
> 
> On all CPUs below the x87 code takes 3uops, so it is 6 uops for two 
> doubles, 12 for four floats. The number of SSE uops depends on whether 
> aligned or unaligned loads are used. Importantly, the extra uops are 
> mostly for the load & store ports, so this is going to translate 
> reasonably well to clock cycles:
> 
> CPU        aligned  unaligned
> PentiumM   6         14
> Core2      3         14
> AMD K8     6         11
> AMD K10    4         5
> 
> (AMD K7 is the same as K8, except doesn't have SSE2).
> 
> Practical conclusion: Probably better to use x87 for the unaligned 
> double case, on everything except K10. For unaligned floats, it's 
> marginal, again only a clear win on the K10. If the _destination_ is 
> aligned, even if the _source_ is not, SSE floats will be better on any 
> of the processors.
> 
> Theoretical conclusion: Don't assume SSE is always faster!
> 
> The balance changes for more complex operations (for simple ones like 
> add, you're limited by memory bandwidth, so SSE doesn't help very much).

Thanks for advise, I'll try to improve it. Actualy I not used assembler 7 years and my knowledge is a bit outdated.