Optimize my code =)

Mon Feb 17 13:56:17 PST 2014

Hiho,

thank you for your code improvements and suggestions.

I really like the foreach loop in D as well as the slight (but 
existing) performance boost over conventional for loops. =)

Another success of the changes I have made is that I have 
achieved to further improve the matrix multiplication performance 
from 3.6 seconds for two 1000x1000 matrices to 1.9 seconds which 
is already very close to java and c++ with about 1.3 - 1.5 
seconds.

The key to victory was pointer arithmetics as I notices that I 
have used them in the C++ implementation, too. xD

The toString implementation has improved its performance slightly 
due to the changes you have mentioned above: 1.37 secs -> 1.29 
secs

I have also adjusted all operator overloadings to the "new style" 
- I just haven't known about that "new style" until then - thanks!

I will just post the whole code again so that you can see what I 
have changed.

Keep in mind that I am still using DMD as compiler and thus 
performance may still raise once I use another compiler!

All in all I am very happy with the code analysis and its 
improvements!
However, there were some strange things of which I am very 
confused ...

void allocationTest() {
	writeln("allocationTest ...");
	sw.start();
	auto m1 = Matrix!double(10000, 10000);
	{ auto m2 = Matrix!double(10000, 10000); }
	{ auto m2 = Matrix!double(10000, 10000); }
	{ auto m2 = Matrix!double(10000, 10000); }
	//{ auto m2 = Matrix!double(10000, 10000); }
	sw.stop();
	printBenchmarks();
}

This is the most confusing code snippet. I have just changed the 
whole allocation for all m1 and m2 from new Matrix!double (on 
heap) to Matrix!double (on stack) and the performance dropped 
significantly - the benchmarked timed raised from 2,3 seconds to 
over 25 seconds!! Now look at the code above. When I leave it as 
it is now, the code requires about 2,9 seconds runtime, however, 
when enabeling the currently out-commented line the code takes 14 
to 25 seconds longer! mind blown ... 0.o This is extremely 
confusion as I allocate these matrices on the stack and since I 
have allocated them within their own scoped-block they should 
instantly release their memory again so that no memory 
consumption takes place for more than 2 matrices at the same 
time. This just wasn't the fact as far as I have tested it.

Another strange things was that the new opEquals implementation:

	bool opEquals(const ref Matrix other) const pure nothrow {
		if (this.dim != other.dim) {
			return false;
		}
		foreach (immutable i; 0 .. this.dim.size) {
			if (this.data[i] != other.data[i]) return false;
		}
		return true;
	}

is actually about 20% faster than the one you have suggested. 
With the single line of "return (this.dim == other.dim && 
this.data[] == other.data[]).

The last thing I haven't quite understood is that I tried to 
replace

auto t = Matrix(other).transposeAssign();

in the matrix multiplication algorithm with its shorter and 
clearer form

auto t = other.transpose(); // sorry for the nasty '()', but I 
like them! :/

This however gave me wonderful segmentation faults on runtime 
while using the matrix multiplication ...

And here is the complete and improved code:
http://dpaste.dzfl.pl/7f8610efa82b

Thanks in advance for helping me! =)

Robin