SIMD on Windows

Sat Jun 29 11:32:52 PDT 2013

On Saturday, 29 June 2013 at 17:57:20 UTC, Jonathan Dunlap wrote:
> I've updated the project with your suggestions at 
> http://dpaste.dzfl.pl/fce2d93b but still get the same 
> performance. Vectors defined in the benchmark function body, no 
> function calling overhead, etc. See some of my comments below 
> btw:
>
>> First of all, calcSIMD and calcScalar are virtual functions so 
>> they can't be inlined, which prevents any further optimization.
>
> For the dlang docs: Member functions which are private or 
> package are never virtual, and hence cannot be overridden.
>
>> So my guess is that the first four multiplications and the 
>> second four multiplications in calcScalar are done in 
>> parallel. ... The reason it's faster is that gdc replaces 
>> multiplication by 2 with addition and omits multiplication by 
>> 1.
>
> I've changed the multiplies of 2 and 1 to 2.1 and 1.01 
> respectively. Still no performance difference between the two 
> for me.

The multiples 2 and 1 were the reason why the scalar code 
performs a little bit better than SIMD code when compiled with 
GDC. The main reason why scalar code isn't much slower than SIMD 
code is instruction level parallelism. Because the first four 
operation in calcScalar are independent (none of them depends on 
the result of any of the other three) modern x86-64 processors 
can execute them in parallel. Because of that, the speed of your 
program is limited by instruction latency and not throughput. 
That's why it doesn't really make a difference that the scalar 
version does four times as many operations.

You can also make advantage of instruction level parallelism when 
using SIMD. For example, I get about the same number of 
iterations per second for the following two functions (when using 
GDC):

         import gcc.attribute;

	@attribute("forceinline") void calcSIMD1() {

		s0 = s0 * i0;

		s0 = s0 * d0;

		s1 = s1 * i1;

		s1 = s1 * d1;

		s2 = s2 * i2;

		s2 = s2 * d2;

		s3 = s3 * i3;

		s3 = s3 * d3;

	}

	@attribute("forceinline") void calcSIMD2() {

		s0 = s0 * i0;

		s0 = s0 * d0;
	}

By the way, if performance is very important to you, you should 
try GDC (or LDC, but I don't think LDC is currently fully usable 
on Windows).