Optimize my code =)

Fri Feb 14 10:59:35 PST 2014

On Friday, 14 February 2014 at 16:00:09 UTC, Robin wrote:
> As I am very new to D I instantly ran into certain optimizing 
> issues. E.g. the simple matrix multiplication based in my java 
> implementation requires about 1.5 secs for multiplying two 
> 1000x1000 matrices, however my D implementation requires 39 
> seconds without compiler optimizations and about 14 seconds 
> with -inline, -O, -noboundscheck and -release. So I am sure 
> that I just missed some possible optimization routines for D 
> and that I could miss entire patters used in D to write more 
> optimized code - especially when looking at the const and 
> immutability concepts.

Try with and without -noboundscheck. Sometimes using it seems to
make things slower, which is odd.

Also, compiling for 64-bit may help significantly. When you
compile for 32-bit, you won't get benefits of SIMD instructions
as it's not guaranteed to be supported by the CPU. Since Java
runs a JIT, it would use these SIMD instructions.

Also, you'll get significantly faster code if you use LDC or GDC.

> class Matrix(T = double) {
> 	private T[] data;
> 	private Dimension dim;
> }

This should be final class or struct, you don't need virtual
methods.

>
> This is my class with its templated data as a one dimensional 
> array (as I don't like jagged-arrays) and a dimension (which is 
> a struct). The dimension object has some util functionality 
> such as getting the total size or mapping (row, col) to a 
> one-dimensional index besides the getter for rows and columns.
>
> this(size_t rows, size_t cols) {
> 	this.dim = Dimension(rows, cols);
> 	this.data = new T[this.dim.size];
> 	enum nil = to!T(0);
> 	foreach(ref T element; this.data) element = nil;
> }

You don't need to!T for setting to 0, just use plain 0. Using
this to will probably result in overhead. Note that this may
evaluated every time you use 'nil', so using a constant will help.

> T opIndex(size_t row, size_t col) const {
> 	immutable size_t i = this.dim.offset(row, col);
> 	if (i >= this.dim.size) {
> 		// TODO - have to learn exception handling in D first. :P
> 	}
> 	return this.data[i];
> }

You're using -noboundscheck then manually implementing bounds
checks, and probably less efficiently than the compiler does.
Either change this to an assert, or remove it. In either case,
the compiler will do it's own built in bounds checking.

This call is especially expensive since you're doing entirely
virtual methods because you specified class instead of final
class or using a struct. None of your Matrix calls will be
inlined.

>
> Which calls:
>
> size_t size() @property const {
> 	return this.rows * this.cols;
> }
>
> I think and hope that this is getting optimized via inlining. =)
> This works similar for opIndexAssign.

Not unless you final class / struct.

>
> The function I am mainly benchmarking is the simple matrix 
> multiplication where one of the multiplied matrices is 
> tranposed first in order to improve cache hit ratio.
>
> Matrix opMul(ref const(Matrix) other) {
> 	if (this.dim.rows != other.dim.cols || this.dim.cols != 
> ther.dim.rows) {
> 		// TODO - still have to learn exception handling first ...
> 	}
> 	auto m = new Matrix(this.dim.rows, other.dim.cols);
> 	auto s = new Matrix(other).transposeAssign();
> 	size_t r1, r2, c;
> 	T sum;
> 	for (r1 = 0; r1 < this.dim.rows; ++r1) {
> 		for (r2 = 0; r2 < other.dim.rows; ++r2) {
> 			sum = to!T(0);
> 			for (c = 0; c < this.dim.cols; ++c) {
> 				sum += this[r1, c] * other[r2, c];
> 			}
> 			m[r1, r2] = sum;
> 		}
> 	}
> 	return m;
> }

These allocations may potentially hurt.
Also, again, the virtual calls here are a huge hit.
Using to!T(0) is also a waste of performance, as youc an just do
straight 0.

> I am aware that there are faster algorithms for matrix 
> multiplication but I am mainly learning how to write efficient 
> D code in general and not based on any algorithm.

Using SIMD instructions manually would of course be much, much,
faster, but I understand that it defeats the point somewhat and
is much more ugly.

>
> This is more or less the same algorithm I am using with java 
> and c++. Both require about 1.5 secs (after java warm-up) to 
> perform this task for two 1000x1000 matrices. D compiled with 
> DMD takes about 14 seconds with all (known-to-me) optimize flag 
> activated. (listed above)

If you used LDC or GDC 64-bit with the changes above, I'd guess
it would be similar. I wouldn't expect D to out-perform Java much
for this particular scenario, there's very little going on and
the JIT should be able to optimize it quite well with SIMD stuff.

>
> I wanted to make s an immutable matrix in order to hopefully 
> improve performance via this change, however I wasn't 
> technically able to do this to be honest.

I don't think this would improve performance.