Is 2X faster large memcpy interesting?

Thu Mar 26 13:08:40 PDT 2009

The next D2 runtime will include my cache-size detection code. This 
makes it possible to write a cache-aware memcpy, using (for example) 
non-temporal writes when the arrays being copied exceed the size of the 
largest cache.
In my tests, it gives a speed-up of approximately 2X in such cases.
The downside is, it's a fair bit of work to implement, and it only 
affects extremely large arrays, so I fear it's basically irrelevant (It 
probably won't help arrays < 32K in size). Do people actually copy 
megabyte-sized arrays?
Is it worth spending any more time on it?

BTW: I tested the memcpy() code provided in AMD's 1992 optimisation 
manual, and in Intel's 2007 manual. Only one of them actually gave any 
benefit when run on a 2008 Intel Core2 -- which was it? (Hint: it wasn't 
Intel!)
I've noticed that AMD's docs are usually greatly superior to Intels, but 
this time the difference is unbelievable.