SIMD on Windows
jerro
a at a.com
Sat Jun 29 11:32:52 PDT 2013
On Saturday, 29 June 2013 at 17:57:20 UTC, Jonathan Dunlap wrote:
> I've updated the project with your suggestions at
> http://dpaste.dzfl.pl/fce2d93b but still get the same
> performance. Vectors defined in the benchmark function body, no
> function calling overhead, etc. See some of my comments below
> btw:
>
>> First of all, calcSIMD and calcScalar are virtual functions so
>> they can't be inlined, which prevents any further optimization.
>
> For the dlang docs: Member functions which are private or
> package are never virtual, and hence cannot be overridden.
>
>> So my guess is that the first four multiplications and the
>> second four multiplications in calcScalar are done in
>> parallel. ... The reason it's faster is that gdc replaces
>> multiplication by 2 with addition and omits multiplication by
>> 1.
>
> I've changed the multiplies of 2 and 1 to 2.1 and 1.01
> respectively. Still no performance difference between the two
> for me.
The multiples 2 and 1 were the reason why the scalar code
performs a little bit better than SIMD code when compiled with
GDC. The main reason why scalar code isn't much slower than SIMD
code is instruction level parallelism. Because the first four
operation in calcScalar are independent (none of them depends on
the result of any of the other three) modern x86-64 processors
can execute them in parallel. Because of that, the speed of your
program is limited by instruction latency and not throughput.
That's why it doesn't really make a difference that the scalar
version does four times as many operations.
You can also make advantage of instruction level parallelism when
using SIMD. For example, I get about the same number of
iterations per second for the following two functions (when using
GDC):
import gcc.attribute;
@attribute("forceinline") void calcSIMD1() {
s0 = s0 * i0;
s0 = s0 * d0;
s1 = s1 * i1;
s1 = s1 * d1;
s2 = s2 * i2;
s2 = s2 * d2;
s3 = s3 * i3;
s3 = s3 * d3;
}
@attribute("forceinline") void calcSIMD2() {
s0 = s0 * i0;
s0 = s0 * d0;
}
By the way, if performance is very important to you, you should
try GDC (or LDC, but I don't think LDC is currently fully usable
on Windows).
More information about the Digitalmars-d
mailing list