Optimize my code =)

Tue Feb 18 15:36:11 PST 2014

Hiho,

I am happy to see that I could encourage so many people to 
discuss about this topic to not only give me interesting answer 
to my questions but also to analyse and evaluate features and 
performance of D.

I have fixed the allocation performance problem via a custom 
destructor method which manually notifies the GC that its data is 
garbage so that the GC can free it in the next cycle and no 
longer have to determine if it is no longer used. This extremely 
speeded up the allocation tests but increased overall runtime 
performance by a very slight amount of time because memory is now 
freed contigeously.

@Nick Sabalausky:
Why should I remove .dub from the copy constructor? In my opinion 
this is important to keep both matrices (source and copy) 
independent from each other. The suggested COW feature for 
matrices sound interesting but also weird. I have to read more 
about that.
Move semantics in D a needed and existing, however, I think that 
the D compiler isn't as good as the C++ compiler in determining 
what is moveable and what not.

Another thing which is hopefully a bug in the current language 
implementation of D is the strange behaviour of the transpose 
method with the only line of code:

return Matrix(this).transposeAssign();

In C++ for example this compiles without any problems and results 
in correct transposed matrix copies - this works due to the 
existance of move semantics in C++ and is one of the coolest 
features since C++11 which increased and simplified codes in many 
cases enormously for value types just as structs in D.

I have ran several tests in order to test when the move assign or 
the move constructors are called in D and whenever I expected 
them to be called the copy-constructor or the postblit was called 
instead which was frustating imo.
Perhaps I still haven't quite understood the things behind the 
scenes in D but it can't be the solution to always copy the whole 
data whenever I could instead have a move of the whole data on 
the heap.

Besides that on suggestion which came up was that I could insert 
the Dimension module into the Matrix module as their are 
semantically working together. However, I find it better to have 
many smaller code snippets instead of fewer bigger ones and 
that's why I keep them both separated.

I also gave scoped imports a try and hoped that they were able to 
reduce my executable file and perhaps increase the performance of 
my program, none of which was true -> confused. Instead I now 
have more lines of code and do not see instantly what 
dependencies the module as itself has. So what is the point in 
scoped imports?

The mixin keyword is also nice to have but it feels similar to a 
dirty C-macro to me where text replacement with lower abstraction 
(unlike templates) takes place. Of course I am wrong and you will 
teach me why but atm I have strange feelings about implementing 
codes with mixins. In this special case: perhaps it isn't a wise 
decision to merge addition with subtraction and perhaps I can 
find faster ways to do that which invole more differences in both 
actions which requires to split both methods up again. 
(theoretical, but it could be)

Another weird thing is that the result ~= text(tabStr, this[r, 
c]) in the toString method is much slower than the two following 
lines of code:

result ~= tabStr;
result ~= to!string(this[r, c]);

Does anybody have an answer to this?

I am now thinking that I know the most important things about how 
to write ordinary (not super) efficient D code - laugh at me if 
you want :D - and will continue extending this benchmarking 
library and whenever I feel bad about a certain algorithm's 
performance I will knock here in this thread again, you know. :P

In the end of my post I just want to summarize the benchmark 
history for the matrix multiplication as I think that it is 
funny: (All tests for two 1000x1000 matrices!)

- The bloody first implementation of the matrix implementation 
(which worked) required about 39 seconds to finish.
- Then I have finally found out the optimizing commands for the 
DMD and the multiplication performance double roughly to about 14 
seconds.
- I created an account here and due to your massive help the 
matrix multiplication required only about 5 seconds shortly after 
due to better const, pure and nothrow usage.
- Through the shift from class to struct and some optimizations 
in memory layout and further usage improvements of const, pure 
and nothrow as well as several array feature usages and foreach 
loop the algorithm performance raised once again and required 
about 3,7 seconds.
- The last notifiable optimization was the implemenatation based 
on pointer arithmentics which again improved the performance from 
3,7 seconds to roughly about 2 seconds.

Due to this development I see that there is still a possibility 
that we could beat the 1,5 seconds from Java or even the 1,3 
seconds from C++! (based on my machine: 64bit-archlinux, dualcore 
2,2ghz, 4gb ram)

There are still many ways to further improve the performance. For 
examply by using LDC on certain hardwares, paralellism and 
perhaps by implementing COW with no GC dependencies. And of 
course I may miss many other possible optimization features of D.

I by myself can say that I have learn a lot and that's the most 
important thing above everything else for me here.

Thank you all for this very interesting conversation! You - the 
community - are a part what makes D a great language. =)

Robin