core.simd woes

Mon Aug 6 18:24:07 PDT 2012

On Monday, 6 August 2012 at 15:15:30 UTC, Manu wrote:
> I think core.simd is only designed for the lowest level of 
> access to the
> SIMD hardware. I started writing std.simd some time back; it is 
> mostly
> finished in a fork, but there are some bugs/missing features in 
> D's SIMD
> support preventing me from finishing/releasing it. (incomplete 
> dmd
> implementation, missing intrinsics, no SIMD literals, can't do 
> unit
> testing, etc)

Yes, I found, and have been referring to, your std.simd library 
for awhile now. Even with your library having GDC only support 
AtM, it's been a help. Thank you.

> The intention was that std.simd would be flat C-style api, 
> which would be
> the lowest level required for practical and portable use.
> It's almost done, and it should make it a lot easier for people 
> to build
> their own SIMD libraries on top. It supplies most useful linear 
> algebraic
> operations, and implements them as efficiently as possible for 
> other
> architectures than just SSE.
> Take a look: 
> https://github.com/TurkeyMan/phobos/blob/master/std/simd.d

Right now I'm working with DMD on Linux x86_64. LDC doesn't 
support SIMD right now, and I haven't built GDC yet, so I can't 
do performance comparisons between the two. I really need to get 
around to setting up GDC, because I've always planned on using 
that as a "release compiler" for my code.

The problem is, as I mentioned above, that performance of SIMD 
completely get's shot when wrapping a float4 into a struct, 
rather than using float4 directly. There are some places where 
(like matrices), where they do make a big impact, but I'm trying 
to find the best solution for general code. For instance my 
current math library looks like:

     struct Vector4 { float x, y, z, w; ... }
     struct Matrix4 { Vector4 x, y, z, w; ... }

but I was planning on changing over to (something like):

     alias float4 Vector4;
     alias float4[4] Matrix4;

So I could use the types directly and reap the performance gains. 
I'm currently doing this to both my D code (still in early 
state), and our C# code for Mono. Both core.simd and Mono.Simd 
have "compiler magic" vector types, but Mono's version gives me 
access to component channels and simple constructors I can use, 
so for user code (and types like the Matrix above, with internal 
vectors) it's very convenient and natural. D's simply isn't, and 
I'm not sure there's any ways around it since again, at least 
with DMD, performance is shot when I put it in a struct.

> On a side note, your example where you're performing a scalar 
> add within a
> vector; this is bad, don't ever do this.
> SSE (ie, x86) is the most tolerant architecture in this regard, 
> but it's
> VERY bad SIMD design. You should never perform any 
> component-wise
> arithmetic when working with SIMD; It's absolutely not portable.
> Basically, a good rule of thumb is, if the keyword 'float' 
> appears anywhere
> that interacts with your SIMD code, you are likely to see worse 
> performance
> than just using float[4] on most architectures.
> Better to factor your code to eliminate any scalar work, and 
> make sure
> 'scalars' are broadcast across all 4 components and continue 
> doing 4d
> operations.
>
> Instead of: @property pure nothrow float x(float4 v) { return 
> v.ptr[0]; }
> Better to use: @property pure nothrow float4 x(float4 v) { 
> return
> swizzle!"xxxx"(v); }

Thanks a lot for telling me this, I don't know much about SIMD 
stuff. You're actually the exact person I wanted to talk to, 
because you do know a lot about this and I've always respected 
your opinions.

I'm not apposed to doing something like:

     float4 addX(ref float4 v, float val)
     {
         float4 f;
         f.x = val
         v += f;
     }

to do single component scalars, but it's very inconvenient for 
users to remember to use:

     vec.addX(scalar);

instead of:

     vec.x += scalar;

But that wouldn't be an issue if I could write custom operators 
for the components what basically did that. But I can't without 
wrapping float, which is why I am requesting these magic types 
get some basic features like that.

I'm wondering if I should be looking at just using inlined ASM 
and use the ASM SIMD instructions directly. I know basic ASM, but 
I don't know what the potential pitfalls of doing that, 
especially with portability. Is there a reason not to do this 
(short of complexity)? I'm also wondering why wrapping a 
core.simd type into a struct completely negates performance.. I'm 
guessing because when I return the struct type, the compiler has 
to think about it as a struct, instead of it's "magic" type and 
all struct types have a bit more overhead.

On a side note, DMD without SIMD is much faster than C# without 
SIMD, by a factor of 8x usually on simple vector types 
(micro-benchmarks), and that's not counting the runtimes startup 
times either. However, when I use Mono.Simd, both DMD (with 
core.simd) and C# are similar performance (see below). Math code 
with Mono C# (with SIMD) actually runs faster on Linux (even 
without the SGen GC or LLVM codegen) than it does on Window 8 
with MS .NET, which I find to be pretty impressive and 
encouraging for our future games with Mono on Android (which has 
been out biggest performance PITA platform so far).

I've noticed some really odd things with core.simd as well, which 
is another reason I'm thing of trying inlined ASM. I'm not sure 
what's causing certain compiler optimizations. For instance, 
given the basic test program, when I do:

     float rand = ...; // user input value

     float4 a, b = [1, 4, -12, 5];

     a.ptr[0] = rand;
     a.ptr[1] = rand + 1;
     a.ptr[2] = rand + 2;
     a.ptr[3] = rand + 3;

     ulong mil;
     StopWatch sw;

     foreach (t; 0 .. testCount)
     {
         sw.start();
         foreach (i; 0 .. 1_000_000)
         {
             a += b;
             b -= a;
         }
         sw.stop();
         mil += sw.peek().msecs;
         sw.reset();
     }

     writeln(a.array, ", ", b.array);
     writeln(cast(double) mil / testCount);

When I run this on my Phenom II X4 920, it completes in ~9ms. For 
comparison, C# Mono.Simd gets almost identical performance with 
identical code. However, if I add:

     auto vec4(float x, float y, float z, float w)
     {
         float4 result;

         result.ptr[0] = x;
         result.ptr[1] = y;
         result.ptr[2] = z;
         result.ptr[3] = w;

         return result;
     }

then replace the vector initialization lines:

     float4 a, b = [ ... ];
     a.ptr[0] = rand;
     ...

with ones using my factory function:

     auto a = vec4(rand, rand+1, rand+2, rand+3);
     auto b = vec4(1, 4, -12, 5);

Then the program consistently completes in 2.15ms...

wtf right? The printed vector output is identical, and there's no 
changes to the loop code (a += b, etc), I just change the 
construction code of the vectors and it runs 4.5x faster. Beats 
me, but I'll take it. Btw, for comparison, if I use a struct with 
an internal float4 it runs in ~19ms, and a struct with four 
floats runs in ~22ms. So you can see my concerns with using 
core.simd types directly, especially when my Intel Mac gets even 
better improvements with SIMD code.
I haven't done extensive test on the Intel, but my original test 
(the one above, only in C# using Mono.Simd) the results for ~55ms 
using a struct with internal float4, and ~5ms for using float4 
directly.

anyways, thanks for the feedback.