Array append performance

Sun Aug 24 21:47:20 PDT 2008

On Sun, 24 Aug 2008 22:14:36 -0400, bearophile <bearophileHUGS at lycos.com>  
wrote:
> And, why the stride? I presume it's useful if you want a 2D matrix.

Yes, stride is a key enabler for integrating 2D/3D/nD matrices with  
built-in arrays.
Other pros:
- Fast O(1) reversing of an array. Looping through an array in reverse was  
a common enough operation to create an additional keyword and language  
feature (foreach_reverse) specifically for it.
- Slicing out a struct element. i.e. slice the red channel out of an image  
or the first name from an Employee array.
- Accessing every Nth element. This is a common MATLAB operation and often  
used in visualization (i.e. a quick and simple resolution adjustment for a  
line plot). Also, some parallel algorithms benefit from interleaving data.  
This is a fairly standard multi-threading technique, particularly when  
processing data[x] may take longer than processing data[y] (though it's  
not the first thing I'd try).

And lastly, SSE works on 16-byte structs, so why not use the extra 4-bytes?

> It may be useful, but maybe it's overkill, so you may want to benchmark  
> just (ptr, length, capacity) too.

For a 12-byte struct, either {ptr, length, capacity} or {ptr, length,  
stride}, the performance decrease for -release -o -inline was -0.7%, 0.3%  
and 0.2% for 1, 2 and 3 arrays respectively and -0%, 8% and 0.8%  
respectively for debug.

> Another useful benchmark is to test a program that manages tons of short  
> strings too, like this one:  
> http://en.wikipedia.org/wiki/D_language#Example_2

Okay, looking just at the main loop (writefln is really slow) the  
performance cost of the extra 4-bytes for a capacity field is about 1.7%  
(naive cost, i.e no SSE, etc). I didn't actually implement a capacity  
field, so this is a worst case.

> You may want to address some of the things superdan has said.

Superdan raised the good point that if the GC is aware of array length at  
creation, it might compact the capacity to allocate something else. To the  
best of my knowledge (which isn't saying much) both the Phobos and Tango  
GC are block base, and so are un-aware of the array length. Also, Sean  
Kelly wrote the Tango GC and isn't raising any red flags, which is  
probably a good sign it's a valid optimization.