optimized array operations

Fri Sep 26 02:26:38 PDT 2008

Eugene Pelekhay wrote:
> Jb Wrote:
> 
>> If you are doing unaligned memory acesses it's actualy faster to do this..
>>
>> MOVLPS    XMM0,[address]
>> MOVHPS   XMM0,[address+8]
>>
>> Than it is to do
>>
>> MOVUPS  XMM0,[address]
>>
>> The reason being that (on almost all but a very latest chips) SSE ops are 
>> actualy split into 2 64 bit ops. So the former code actualy works out a lot 
>> faster.
>>
>> Also, unaligned loads are a whole lot quicker than unaligned stores. 2 or 3 
>> times faster IIRC. So the best method is bend over backwards to get your 
>> writes aligned.
>>
> 
> Thanks, I'll check this way too.
> Meanwile can anybody test new version on other systems, I implemented operations for unaligned case by x87 instructions and my benchamrc show that it works much slower then SSE2 version. This means that Don's theory wrong or I having unusual Pentium-M or I have bad x87 code.
You have way too many indexing operations. Also, unrolling by 8 makes 
the code so big that you probably get limited by instruction decoding.
The whole loop can be reduced to something like (not tested):

// EAX=length.
// count UP from -length
  lea EDX, [EDX + 8*EAX];
  lea EDI, [EDI + 8*EAX];
  lea ESI, [ESI + 8*EAX];
  neg EAX;
start:
  fld dword ptr [EDX+8*EAX];
  fadd dword ptr [ESI+8*EAX];
  fstp dword ptr [EDI+8*EAX];
  add EAX, 1;
  jnz start;

There are 5 fused uops in the loop. Every instruction is 1 uop, so 
decoding is not a bottleneck.
There are two memory loads per loop (execution unit p2), one store (p3), 
add EAX uses p0 or p1, jnz uses p1, fadd uses p0 or p1. Since Pentium M 
can do 3uops per clock as long as they're in different units, the best 
case would be two clocks per loop.
Loop unrolling _might_ be necessary to get it to schedule the 
instructions correctly, but otherwise it's unhelpful.

On PentiumM there's a bug which means it keeps trying to do two fadds at 
once, even though it only has one FADD execution unit. So one keeps 
getting stalled, so it probably won't be as fast as it should be. 
Sometimes you can fix that by moving the add EAX above the store, or 
above the fadd. On Core2 you should get 2 clocks per iteration without 
any trouble.