summing large arrays - or - performance of tight loops

Mon Apr 23 11:55:39 PDT 2007

> > Using a 1024 x 1024 x 64 array, I got:
> >
> > P4:    97% (linux32 FC5)
> > AMD64: 92% (WinXP32)
> >
> > So, the array size seems to make some difference, at least on AMD machines.
> 
> The results strongly depend on the memory architecture and to a lesser
> extend on the element values. I've put an updated version online that
> contains results for byte, short, int, long, float and double.

Actually, the size of the data type doesn't matter at all for a properly implemented algorithm - as a general rule, you implement a duff's device to align and then use the largest sized instruction you can fit.  Right now the SSE2 instruction "movaps" is quite effective for copying memory.

Also, each operating system implements Page tracking differently.  Some do it by inserting some metadata onto the Page itself.  For that reason, implementing 4kb arrays can perform really well on some OS's, and very very poorly on others (they take 2 pages when you think they're taking 1, so cache and page misses screw you up)

For that reason, it can (very rarely but occassionally) actually improve performance to *not* use a power of two array, but something just short of it.

Sincerely,
Dan