core.simd woes

Tue Aug 7 01:50:58 PDT 2012

On 7 August 2012 04:24, F i L <witte2008 at gmail.com> wrote:

> Right now I'm working with DMD on Linux x86_64. LDC doesn't support SIMD
> right now, and I haven't built GDC yet, so I can't do performance
> comparisons between the two. I really need to get around to setting up GDC,
> because I've always planned on using that as a "release compiler" for my
> code.
>
> The problem is, as I mentioned above, that performance of SIMD completely
> get's shot when wrapping a float4 into a struct, rather than using float4
> directly. There are some places where (like matrices), where they do make a
> big impact, but I'm trying to find the best solution for general code. For
> instance my current math library looks like:
>
>     struct Vector4 { float x, y, z, w; ... }
>     struct Matrix4 { Vector4 x, y, z, w; ... }
>
> but I was planning on changing over to (something like):
>
>     alias float4 Vector4;
>     alias float4[4] Matrix4;
>
> So I could use the types directly and reap the performance gains. I'm
> currently doing this to both my D code (still in early state), and our C#
> code for Mono. Both core.simd and Mono.Simd have "compiler magic" vector
> types, but Mono's version gives me access to component channels and simple
> constructors I can use, so for user code (and types like the Matrix above,
> with internal vectors) it's very convenient and natural. D's simply isn't,
> and I'm not sure there's any ways around it since again, at least with DMD,
> performance is shot when I put it in a struct.
>

I'm not sure why the performance would suffer when placing it in a struct.
I suspect it's because the struct causes the vectors to become unaligned,
and that impacts performance a LOT. Walter has recently made some changes
to expand the capability of align() to do most of the stuff you expect
should be possible, including aligning structs, and propogating alignment
from a struct member to its containing struct. This change might actually
solve your problems...

Another suggestion I might make, is to write DMD intrinsics that mirror the
GDC code in std.simd and use that, then I'll sort out any performance
problems as soon as I have all the tools I need to finish the module :)
There's nothing inherent in the std.simd api that will produce slower than
optimal code when everything is working properly.

Better to factor your code to eliminate any scalar work, and make sure
>> 'scalars' are broadcast across all 4 components and continue doing 4d
>> operations.
>>
>> Instead of: @property pure nothrow float x(float4 v) { return v.ptr[0]; }
>>
>> Better to use: @property pure nothrow float4 x(float4 v) { return
>> swizzle!"xxxx"(v); }
>>
>
> Thanks a lot for telling me this, I don't know much about SIMD stuff.
> You're actually the exact person I wanted to talk to, because you do know a
> lot about this and I've always respected your opinions.
>
> I'm not apposed to doing something like:
>
>     float4 addX(ref float4 v, float val)
>     {
>         float4 f;
>         f.x = val
>         v += f;
>     }
>

> to do single component scalars, but it's very inconvenient for users to
> remember to use:
>
>     vec.addX(scalar);
>
> instead of:
>
>     vec.x += scalar;
>

And this is precisely what I suggest you don't do. x64-SSE is the only
architecture that can reasonably tolerate this (although it's still not the
most efficient way). So if portability is important, you need to find
another way.

A 'proper' way to do this is something like:
  float4 wideScalar = loadScalar(scalar); // this function loads a float
into all 4 components. Note: this is a little slow, factor these
float->vector loads outside the hot loops as is practical.

  float4 vecX = getX(vec); // we can make shorthand for this, like
'vec.xxxx' for instance...
  vecX += wideScalar; // all 4 components maintain the same scalar value,
this is so you can apply them back to non-scalar vectors later:

With this, there are 2 typical uses, one is to scale another vector by your
scalar, for instance:
  someOtherVector *= vecX; // perform a scale of a full 4d vector by our
'wide' scalar

The other, less common operation, is that you may want to directly set the
scalar to a component of another vector, setting Y to lock something to a
height map for instance:
  someOtherVector = setY(someOtherVector, wideScalar); // note: it is still
important that you have a 'wide' scalar in this case for portability, since
different architectures have very different interleave operations.

Something like '.x' can never appear in efficient code.
Sadly, most modern SIMD hardware is simply not able to efficiently express
what you as a programmer intuitively want as convenient operations.
Most SIMD hardware has absolutely no connection between the FPU and the
SIMD unit, resulting in loads and stores to memory, and this in turn
introduces another set of performance hazards.
x64 is actually the only architecture that does allow interaction between
the FPU and SIMD however, although it's still no less efficient to do it
how I describe, and as a bonus, your code will be portable.

But that wouldn't be an issue if I could write custom operators for the
> components what basically did that. But I can't without wrapping float,
> which is why I am requesting these magic types get some basic features like
> that.
>

See above.

I'm wondering if I should be looking at just using inlined ASM and use the
> ASM SIMD instructions directly. I know basic ASM, but I don't know what the
> potential pitfalls of doing that, especially with portability. Is there a
> reason not to do this (short of complexity)? I'm also wondering why
> wrapping a core.simd type into a struct completely negates performance..
> I'm guessing because when I return the struct type, the compiler has to
> think about it as a struct, instead of it's "magic" type and all struct
> types have a bit more overhead.
>

Inline asm is usually less efficient for large blocks of code, it requires
that you hand-tune the opcode sequencing, which is very hard to do,
particularly for SSE.
Small inline asm blocks are also usually less efficient, since most
compilers can't rearrange other code within the function around the asm
block, this leads to poor opcode sequencing.
I recommend avoiding inline asm where performance is desired unless you're
confident in writing the ENTIRE function/loop in asm, and hand tuning the
opcode sequencing. But that's not portable...

On a side note, DMD without SIMD is much faster than C# without SIMD, by a
> factor of 8x usually on simple vector types (micro-benchmarks), and that's
> not counting the runtimes startup times either. However, when I use
> Mono.Simd, both DMD (with core.simd) and C# are similar performance (see
> below). Math code with Mono C# (with SIMD) actually runs faster on Linux
> (even without the SGen GC or LLVM codegen) than it does on Window 8 with MS
> .NET, which I find to be pretty impressive and encouraging for our future
> games with Mono on Android (which has been out biggest performance PITA
> platform so far).
>

Android? But you're benchmarking x64-SSE right? I don't think it's
reasonable to expect that performance characteristics for one architectures
SIMD hardware will be any indicator at all of how another architecture may
perform.
Also, if you're doing any of the stuff I've been warning against above,
NEON will suffer very hard, whereas x64-SSE will mostly shrug it off.

I'm very interested to hear your measurements when you try it out!

I've noticed some really odd things with core.simd as well, which is
> another reason I'm thing of trying inlined ASM. I'm not sure what's causing
> certain compiler optimizations. For instance, given the basic test program,
> when I do:
>
>     float rand = ...; // user input value
>
>     float4 a, b = [1, 4, -12, 5];
>
>     a.ptr[0] = rand;
>     a.ptr[1] = rand + 1;
>     a.ptr[2] = rand + 2;
>     a.ptr[3] = rand + 3;
>
>     ulong mil;
>     StopWatch sw;
>
>     foreach (t; 0 .. testCount)
>     {
>         sw.start();
>         foreach (i; 0 .. 1_000_000)
>         {
>             a += b;
>             b -= a;
>         }
>         sw.stop();
>         mil += sw.peek().msecs;
>         sw.reset();
>     }
>
>     writeln(a.array, ", ", b.array);
>     writeln(cast(double) mil / testCount);
>
> When I run this on my Phenom II X4 920, it completes in ~9ms. For
> comparison, C# Mono.Simd gets almost identical performance with identical
> code. However, if I add:
>
>     auto vec4(float x, float y, float z, float w)
>     {
>         float4 result;
>
>         result.ptr[0] = x;
>         result.ptr[1] = y;
>         result.ptr[2] = z;
>         result.ptr[3] = w;
>
>         return result;
>     }
>
> then replace the vector initialization lines:
>
>     float4 a, b = [ ... ];
>     a.ptr[0] = rand;
>     ...
>
> with ones using my factory function:
>
>     auto a = vec4(rand, rand+1, rand+2, rand+3);
>     auto b = vec4(1, 4, -12, 5);
>
> Then the program consistently completes in 2.15ms...
>
> wtf right? The printed vector output is identical, and there's no changes
> to the loop code (a += b, etc), I just change the construction code of the
> vectors and it runs 4.5x faster. Beats me, but I'll take it. Btw, for
> comparison, if I use a struct with an internal float4 it runs in ~19ms, and
> a struct with four floats runs in ~22ms. So you can see my concerns with
> using core.simd types directly, especially when my Intel Mac gets even
> better improvements with SIMD code.
> I haven't done extensive test on the Intel, but my original test (the one
> above, only in C# using Mono.Simd) the results for ~55ms using a struct
> with internal float4, and ~5ms for using float4 directly.
>

wtf indeed! O_o

Can you paste the disassembly?
There should be no loads or stores in the loop, therefore it should be
unaffected... but it obviously is, so the only thing I can imagine that
could make a different in the inner loop like that is a change in
alignment. Wrapping a vector in a struct will break the alignment, since,
until recently, DMD didn't propagate aligned members outwards to the
containing struct (which I think Walter fixed in 2.60??).

I can tell you this though, as soon as DMDs SIMD support is able to do the
missing stuff I need to complete std.simd, I shall do that, along with
intensive benchmarks where I'll be scrutinising the code-gen very closely.
I expect performance peculiarities like you are seeing will be found and
fixed at that time...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/digitalmars-d/attachments/20120807/1c0af996/attachment-0001.html>