optimized array operations

Tue Sep 23 02:02:51 PDT 2008

Eugene Pelekhay wrote:
> I'm finished optimized version of array operations, using SSE2 instructions.

Good work. Note that your code will actually work better for floats 
(using SSE) than with SSE2.

As far as I can tell, X87 code may actually be faster for the unaligned 
case.

Comparing x87 code

fld mem
fadd mem
fstp mem

with SSE code

movapd/movupd  reg, mem
addpd reg, mem
movapd/movupd mem, reg

On all CPUs below the x87 code takes 3uops, so it is 6 uops for two 
doubles, 12 for four floats. The number of SSE uops depends on whether 
aligned or unaligned loads are used. Importantly, the extra uops are 
mostly for the load & store ports, so this is going to translate 
reasonably well to clock cycles:

CPU        aligned  unaligned
PentiumM   6         14
Core2      3         14
AMD K8     6         11
AMD K10    4         5

(AMD K7 is the same as K8, except doesn't have SSE2).

Practical conclusion: Probably better to use x87 for the unaligned 
double case, on everything except K10. For unaligned floats, it's 
marginal, again only a clear win on the K10. If the _destination_ is 
aligned, even if the _source_ is not, SSE floats will be better on any 
of the processors.

Theoretical conclusion: Don't assume SSE is always faster!

The balance changes for more complex operations (for simple ones like 
add, you're limited by memory bandwidth, so SSE doesn't help very much).