Optimize my code =)

Tue Feb 18 16:04:01 PST 2014

Robin:

> the existance of move semantics in C++ and is one of the 
> coolest features since C++11 which increased and simplified 
> codes in many cases enormously for value types just as structs 
> in D.

I guess Andrei doesn't agree with you (and move semantics in 
C++11 is quite hard to understand).

> I also gave scoped imports a try and hoped that they were able 
> to reduce my executable file and perhaps increase the 
> performance of my program, none of which was true -> confused. 
> Instead I now have more lines of code and do not see instantly 
> what dependencies the module as itself has. So what is the 
> point in scoped imports?

Scoped imports in general can't increase performance. Their main 
point is to avoid importing modules that are needed only by 
templated code. So if you don't instantiate the template, the 
liker works less and the binary is usually smaller (no 
moduleinfo, etc).

> Another weird thing is that the result ~= text(tabStr, this[r, 
> c]) in the toString method is much slower than the two 
> following lines of code:
>
> result ~= tabStr;
> result ~= to!string(this[r, c]);
>
> Does anybody have an answer to this?

It doesn't look too much weird. In the first case you are 
allocating and creating larger strings. But I don't think matrix 
printing is a bottleneck in a program.

> - Then I have finally found out the optimizing commands for the 
> DMD

This is a small but common problem. Perhaps worth fixing.

> There are still many ways to further improve the performance. 
> For examply by using LDC

Latest stable and unstable versions of LDC2, try it:
https://github.com/ldc-developers/ldc/releases/tag/v0.12.1
https://github.com/ldc-developers/ldc/releases/tag/v0.13.0-alpha1

> on certain hardwares, paralellism and perhaps by implementing 
> COW with no GC dependencies. And of course I may miss many 
> other possible optimization features of D.

Matrix multiplication can be improved a lot tiling the matrix (or 
better using a cache oblivious algorithm), using SSE/AVX2, using 
multiple cores, etc. As starting point you can try to use 
std.parallelism. It could speed up your code on 4 cores with a 
very limited amount of added code.

Bye,
bearophile