optimized array operations
Jb
jb at nowhere.com
Thu Sep 25 15:41:36 PDT 2008
"Eugene Pelekhay" <pelekhay at gmail.com> wrote in message
news:gbeaf0$g28$1 at digitalmars.com...
> Don Wrote:
>
>> Eugene Pelekhay wrote:
>> > I'm finished optimized version of array operations, using SSE2
>> > instructions.
>>
>> Good work. Note that your code will actually work better for floats
>> (using SSE) than with SSE2.
>>
>> As far as I can tell, X87 code may actually be faster for the unaligned
>> case.
>>
>> Comparing x87 code
>>
>> fld mem
>> fadd mem
>> fstp mem
>>
>> with SSE code
>>
>> movapd/movupd reg, mem
>> addpd reg, mem
>> movapd/movupd mem, reg
>>
>> On all CPUs below the x87 code takes 3uops, so it is 6 uops for two
>> doubles, 12 for four floats. The number of SSE uops depends on whether
>> aligned or unaligned loads are used. Importantly, the extra uops are
>> mostly for the load & store ports, so this is going to translate
>> reasonably well to clock cycles:
>>
>> CPU aligned unaligned
>> PentiumM 6 14
>> Core2 3 14
>> AMD K8 6 11
>> AMD K10 4 5
>>
>> (AMD K7 is the same as K8, except doesn't have SSE2).
>>
>> Practical conclusion: Probably better to use x87 for the unaligned
>> double case, on everything except K10. For unaligned floats, it's
>> marginal, again only a clear win on the K10. If the _destination_ is
>> aligned, even if the _source_ is not, SSE floats will be better on any
>> of the processors.
>>
>> Theoretical conclusion: Don't assume SSE is always faster!
>>
>> The balance changes for more complex operations (for simple ones like
>> add, you're limited by memory bandwidth, so SSE doesn't help very much).
>
> Thanks for advise, I'll try to improve it. Actualy I not used assembler 7
> years and my knowledge is a bit outdated.
If you are doing unaligned memory acesses it's actualy faster to do this..
MOVLPS XMM0,[address]
MOVHPS XMM0,[address+8]
Than it is to do
MOVUPS XMM0,[address]
The reason being that (on almost all but a very latest chips) SSE ops are
actualy split into 2 64 bit ops. So the former code actualy works out a lot
faster.
Also, unaligned loads are a whole lot quicker than unaligned stores. 2 or 3
times faster IIRC. So the best method is bend over backwards to get your
writes aligned.
More information about the Digitalmars-d-announce
mailing list