Optimize my code =)

Sat Feb 15 15:06:25 PST 2014

Hiho,

wow, first of all: this community is awesome - so many kind and 
interesting answers. =)

With your help I was able to achieve a performance boost for 
several different operations.

Some benchmarks:

Allocation of 5 10.000 x 10.000 matrices in a row:
Before: ~8,2 seconds
After: ~2,3 seconds (with the minimallyInitializedArray!)

Multiplication of two 1000x1000 matrices:
Before: ~14,8 seconds
After: ~4,3 seconds

However, I think there is still much potential in order to 
further optimize this code. Let me tell you what changes I have 
mainly performed on the code so far ...

Matrix is still a class but I changed it to a final class 
preventing matrix methods to be virtual. Dimension is now a final 
struct (don't know if 'final' is affecting structs in any way 
tough ...). This mainly gave the multiplication a huge 
performance boost.

When converting the Matrix to a struct from class the 
multiplication even lowered from ~4.3 seconds to about 3.6 
seconds. However, I am currently not sure if I want matrices to 
be structs (value types).

Besides that I tried to add nothrow and pure as attribute to 
every possible method. However, as it turned out I wasn't able to 
add the pure modifier to any method as I always called an impure 
method with it (as stated by the compiler). This actually made 
sense most of the times and I think the only pure methods now are 
the constructor methods of the Dimension struct.

What may still speed up things?

In my tests it turned out that the simple code:

auto m1 = Matrix!double.random(1000, 1000, -10, 10, 0.25);
auto m2 = Matrix!double.random(1000, 1000, -10, 10, 0.25);
auto m3 = m1 * m2;

Called the normal copy-constructor. This is sad as it would be a 
huge performance boost if it would make use of the move 
semantics. (In the C++ matrix codes this scenario would actually 
call move assignment operator for matrix m3 which is much much 
faster than copying.)

But I haven't figured out yet how to use move semantics in D with 
class objects. Or is that just possible with struct value types?

I have also tried the LDC compiler. However, it gave me some 
strange bugs. E.g. it didn't like the following line:

ref Matrix transpose() const {
	return new Matrix(this).transposeAssign();
}

And it forced me to change it to:

ref Matrix transpose() const {
	auto m = new Matrix(this);
	m.transposeAssign();
	return m;
}

Which is kind of ugly ...
I hope that this is getting fixed soon, as I imply it as correct 
D code because the DMD is capable to compile this correctly.

Some of you came up with the thing that transposing the matrix 
before multiplication of both takes place must be much slower 
than without the transposition. In my former Java and C++ 
programs I have already tested what strategy is faster and it 
turned out that cache efficiency DOES matter of course. There are 
also papers explaining why a transpose matrix multiplication may 
be faster than without transposing.
In the end this is just a test suite and there is already an even 
faster approach of a matrix multiplication which runs in O(n^2.8) 
instead of O(n^3) as my current simple solution. And with the 
power of OpenCL (or similar) one could even lift a matrix 
multiplication to the GPU and boom.^^ But the current task is to 
find general optimized code for the D language - this thread 
already helped me much!

I wasn't aware that D is actually capable of lazy evaluations 
other than the if-pseude-lazy-evaluation which is kind of cool. 
However, I still have to look that up in order to maybe find a 
use for it in these codes.

SIMD instructions also sound extremely cool but a little bit too 
complex regarding the fact that I am fairly new to D and still 
learning basics of this language.

In the end I wanted to state something which I found very 
iritating. Bearophile stated correctly that my pre-conditions for 
the index operators are all wrong and corrected my code with some 
smooth additions:

T opIndex(in size_t row, in size_t col) const nothrow
in {
     assert(row < nRows);
     assert(col < nCols);
} body {
     return data[dim.offset(row, col)];
}

The in and body statements are cool as far as I realize what they 
are for. However, in the benchmarks they had a clear and 
noticable negative impact on the matrix multiplication which 
raised to ~8 seconds from ~4 seconds with his code compared to 
mine. When leaving out the in and body statement block and using 
only one normal block as follows, it stayed at the 4 seconds 
duration for that task and so I think that the compiler won't 
optimize things in an 'in' code block - is that right?

T opIndex(in size_t row, in size_t col) const nothrow {
     assert(row < nRows);
     assert(col < nCols);
     return data[dim.offset(row, col)];
}

Thanks again for all your helpful comments and thanks in advance 
- I am eagerly looking forward for your future comments! =)

Robin