optimized array operations

Thu Sep 25 15:41:36 PDT 2008

"Eugene Pelekhay" <pelekhay at gmail.com> wrote in message 
news:gbeaf0$g28$1 at digitalmars.com...
> Don Wrote:
>
>> Eugene Pelekhay wrote:
>> > I'm finished optimized version of array operations, using SSE2 
>> > instructions.
>>
>> Good work. Note that your code will actually work better for floats
>> (using SSE) than with SSE2.
>>
>> As far as I can tell, X87 code may actually be faster for the unaligned
>> case.
>>
>> Comparing x87 code
>>
>> fld mem
>> fadd mem
>> fstp mem
>>
>> with SSE code
>>
>> movapd/movupd  reg, mem
>> addpd reg, mem
>> movapd/movupd mem, reg
>>
>> On all CPUs below the x87 code takes 3uops, so it is 6 uops for two
>> doubles, 12 for four floats. The number of SSE uops depends on whether
>> aligned or unaligned loads are used. Importantly, the extra uops are
>> mostly for the load & store ports, so this is going to translate
>> reasonably well to clock cycles:
>>
>> CPU        aligned  unaligned
>> PentiumM   6         14
>> Core2      3         14
>> AMD K8     6         11
>> AMD K10    4         5
>>
>> (AMD K7 is the same as K8, except doesn't have SSE2).
>>
>> Practical conclusion: Probably better to use x87 for the unaligned
>> double case, on everything except K10. For unaligned floats, it's
>> marginal, again only a clear win on the K10. If the _destination_ is
>> aligned, even if the _source_ is not, SSE floats will be better on any
>> of the processors.
>>
>> Theoretical conclusion: Don't assume SSE is always faster!
>>
>> The balance changes for more complex operations (for simple ones like
>> add, you're limited by memory bandwidth, so SSE doesn't help very much).
>
> Thanks for advise, I'll try to improve it. Actualy I not used assembler 7 
> years and my knowledge is a bit outdated.

If you are doing unaligned memory acesses it's actualy faster to do this..

MOVLPS    XMM0,[address]
MOVHPS   XMM0,[address+8]

Than it is to do

MOVUPS  XMM0,[address]

The reason being that (on almost all but a very latest chips) SSE ops are 
actualy split into 2 64 bit ops. So the former code actualy works out a lot 
faster.

Also, unaligned loads are a whole lot quicker than unaligned stores. 2 or 3 
times faster IIRC. So the best method is bend over backwards to get your 
writes aligned.