Is 2X faster large memcpy interesting?

Fri Mar 27 02:44:21 PDT 2009

Andrei Alexandrescu wrote:
> Don wrote:
>> The next D2 runtime will include my cache-size detection code. This 
>> makes it possible to write a cache-aware memcpy, using (for example) 
>> non-temporal writes when the arrays being copied exceed the size of 
>> the largest cache.
>> In my tests, it gives a speed-up of approximately 2X in such cases.
>> The downside is, it's a fair bit of work to implement, and it only 
>> affects extremely large arrays, so I fear it's basically irrelevant 
>> (It probably won't help arrays < 32K in size). Do people actually copy 
>> megabyte-sized arrays?
>> Is it worth spending any more time on it?
> 
> I'd think so. In this day and age it is appalling that we don't quite 
> know how to quickly copy memory around. A long time ago I ran some 
> measurements (http://www.ddj.com/cpp/184403799) and I was quite 
> surprised. My musings were as true then as now. And now we're getting to 
> the second freakin' Space Odyssey!
> 
> ===============
> Things are clearly hazy, aren't they? First off, maybe it came as a 
> surprise to you that there's more than one way to fill and copy objects. 
> Then, there's no single variant of fill and copy that works best on all 
> compilers, data sets, and machines. (I guess if I tested the same code 
> on a Celeron, which has less cache, I would have gotten very different 
> results. To say nothing about other architectures.)

More precisely, the optimal algorithm depends on the size of the 
affected memory, and the size of the CPU caches. However, the only thing 
which really varies between machines is where the thresholds are.

> As a rule of thumb, it's generally good to use memcpy (and consequently 
> fill-by-copy) if you can — for large data sets, memcpy doesn't make much 
> difference, and for smaller data sets, it might be much faster. For 
> cheap-to-copy objects, Duff's Device might perform faster than a simple 
> for loop. Ultimately, all this is subject to your compiler's and 
> machine's whims and quirks.

This is why I think it's absolutely critical to have cache size 
determination as a fundamental runtime library primitive. Since memory 
speeds have only tripled since 1980, but processor speeds have improved 
by 1000X, I think it's become less and less acceptable for a systems 
language to be cache-unaware.

> There is a very deep, and sad, realization underlying all this. We are 
> in 2001, the year of the Spatial Odyssey. We've done electronic 
> computing for more than 50 years now, and we strive to design more and 
> more complex systems, with unsatisfactory results. Software development 
> is messy. Could it be because the fundamental tools and means we use are 
> low-level, inefficient, and not standardized? Just step out of the box 
> and look at us — after 50 years, we're still not terribly good at 
> filling and copying memory.
> ================
> 
> Andrei

Summarizing everyones comments -- nobody thinks that fast large memcopy 
is terribly important, but a slow memcpy seems philosophically wrong.

I wanted to work on getting array operations to be cache-aware. Of 
course, the simplest possible array operation is a[]=b[].
On a Core2 with 1Mb/core L2 cache, DMD's memcpy performance drops to 0.6 
bytes/cycles once you fall out of the L2 cache; but a 64-bit CPU can do 
8 bytes/cycle under optimum conditions, when the data is coming from the 
L1 cache. So I figured it doesn't make sense to optimize the more 
complex operations when the trivial one is has so much potential for 
improvement.