optimized array operations
Don
nospam at nospam.com.au
Tue Sep 23 02:02:51 PDT 2008
Eugene Pelekhay wrote:
> I'm finished optimized version of array operations, using SSE2 instructions.
Good work. Note that your code will actually work better for floats
(using SSE) than with SSE2.
As far as I can tell, X87 code may actually be faster for the unaligned
case.
Comparing x87 code
fld mem
fadd mem
fstp mem
with SSE code
movapd/movupd reg, mem
addpd reg, mem
movapd/movupd mem, reg
On all CPUs below the x87 code takes 3uops, so it is 6 uops for two
doubles, 12 for four floats. The number of SSE uops depends on whether
aligned or unaligned loads are used. Importantly, the extra uops are
mostly for the load & store ports, so this is going to translate
reasonably well to clock cycles:
CPU aligned unaligned
PentiumM 6 14
Core2 3 14
AMD K8 6 11
AMD K10 4 5
(AMD K7 is the same as K8, except doesn't have SSE2).
Practical conclusion: Probably better to use x87 for the unaligned
double case, on everything except K10. For unaligned floats, it's
marginal, again only a clear win on the K10. If the _destination_ is
aligned, even if the _source_ is not, SSE floats will be better on any
of the processors.
Theoretical conclusion: Don't assume SSE is always faster!
The balance changes for more complex operations (for simple ones like
add, you're limited by memory bandwidth, so SSE doesn't help very much).
More information about the Digitalmars-d-announce
mailing list