optimized array operations
Eugene Pelekhay
pelekhay at gmail.com
Wed Sep 24 14:12:00 PDT 2008
Don Wrote:
> Eugene Pelekhay wrote:
> > I'm finished optimized version of array operations, using SSE2 instructions.
>
> Good work. Note that your code will actually work better for floats
> (using SSE) than with SSE2.
>
> As far as I can tell, X87 code may actually be faster for the unaligned
> case.
>
> Comparing x87 code
>
> fld mem
> fadd mem
> fstp mem
>
> with SSE code
>
> movapd/movupd reg, mem
> addpd reg, mem
> movapd/movupd mem, reg
>
> On all CPUs below the x87 code takes 3uops, so it is 6 uops for two
> doubles, 12 for four floats. The number of SSE uops depends on whether
> aligned or unaligned loads are used. Importantly, the extra uops are
> mostly for the load & store ports, so this is going to translate
> reasonably well to clock cycles:
>
> CPU aligned unaligned
> PentiumM 6 14
> Core2 3 14
> AMD K8 6 11
> AMD K10 4 5
>
> (AMD K7 is the same as K8, except doesn't have SSE2).
>
> Practical conclusion: Probably better to use x87 for the unaligned
> double case, on everything except K10. For unaligned floats, it's
> marginal, again only a clear win on the K10. If the _destination_ is
> aligned, even if the _source_ is not, SSE floats will be better on any
> of the processors.
>
> Theoretical conclusion: Don't assume SSE is always faster!
>
> The balance changes for more complex operations (for simple ones like
> add, you're limited by memory bandwidth, so SSE doesn't help very much).
Thanks for advise, I'll try to improve it. Actualy I not used assembler 7 years and my knowledge is a bit outdated.
More information about the Digitalmars-d-announce
mailing list