foreach - premature optimization vs cultivating good habits

Sat Jan 31 13:53:03 PST 2015

On Friday, 30 January 2015 at 14:41:11 UTC, Laeeth Isharc wrote:
> Thanks, Adam.  That's what I had thought (your first 
> paragraph), but something Ola on a different thread confused me 
> and made me think I didn't understand it, and I wanted to pin 
> it down.

There is always significant optimization effects in long running 
loops:
- SIMD
- cache locality / prefetching

For the former (SIMD) you need to make sure that good code is 
generated either by hand, by using vectorized libraries or by 
auto vectorization.

For the latter (cache) you need to make sure that the prefetcher 
is able to predict or is being told to prefetch explicitly and 
also that the working set is small enough to stay at the faster 
cache levels.

If you want good performance you cannot ignore any of these, and 
you have to design the data structures and algorithms for it. 
Prefetching has to happen maybe 100 instructions before the 
actual load from memory and AVX requires byte alignment and a 
layout that fits the algorithm. On next gen Xeon Skylake I think 
the alignment might go up to 64 byte and you have 512 bits wide 
registers (so you can do 8 64 bit floating point operations in 
parallel per core). The difference between issuing 1-4 ops and 
issuing 8-16 per time unit is noticable...

An of course, the closer your code is to theoretical throughput 
in the CPU, the more critical it becomes to not wait for memory 
loads.

This is also a moving target...