summing large arrays - or - performance of tight loops

Mon Apr 23 13:37:37 PDT 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dan schrieb am 2007-04-23:
>> > Using a 1024 x 1024 x 64 array, I got:
>> >
>> > P4:    97% (linux32 FC5)
>> > AMD64: 92% (WinXP32)
>> >
>> > So, the array size seems to make some difference, at least on AMD machines.
>> 
>> The results strongly depend on the memory architecture and to a lesser
>> extend on the element values. I've put an updated version online that
>> contains results for byte, short, int, long, float and double.
>
> Actually, the size of the data type doesn't matter at all for a properly
> implemented algorithm - as a general rule, you implement a duff's device
> to align and then use the largest sized instruction you can fit.  Right now
> the SSE2 instruction "movaps" is quite effective for copying memory.

That's what I thought too, but while my SSE version for float and double
didn't have the worst performance they were by no means the fastest.

Thomas

-----BEGIN PGP SIGNATURE-----

iD8DBQFGLRkOLK5blCcjpWoRAkhDAKCb4IU0RG6HTzL1DywM4yClWwK9eACfYshW
vdYZYS3eIKhBLclsDOyq19M=
=n+9d
-----END PGP SIGNATURE-----