core.simd woes

Tue Aug 7 18:45:50 PDT 2012

Manu wrote:
> I'm not sure why the performance would suffer when placing it 
> in a struct.
> I suspect it's because the struct causes the vectors to become 
> unaligned,
> and that impacts performance a LOT. Walter has recently made 
> some changes
> to expand the capability of align() to do most of the stuff you 
> expect
> should be possible, including aligning structs, and propogating 
> alignment
> from a struct member to its containing struct. This change 
> might actually
> solve your problems...

I've tried all combinations with align() before and inside the 
struct, with no luck. I'm using DMD 2.060, so unless there's a 
new syntax I'm unaware of, I don't think it's been adjusted to 
fix any alignment issues with SIMD stuff. It would be great to be 
able to wrap float4 into a struct, but for now I've come up with 
an easy and understandable alternative using SIMD types directly.

> Another suggestion I might make, is to write DMD intrinsics 
> that mirror the
> GDC code in std.simd and use that, then I'll sort out any 
> performance
> problems as soon as I have all the tools I need to finish the 
> module :)

Sounds like a good idea. I'll try and keep my code inline with 
yours to make transitioning to it easier when it's complete.

> And this is precisely what I suggest you don't do. x64-SSE is 
> the only
> architecture that can reasonably tolerate this (although it's 
> still not the
> most efficient way). So if portability is important, you need 
> to find
> another way.
>
> A 'proper' way to do this is something like:
>   float4 wideScalar = loadScalar(scalar); // this function 
> loads a float
> into all 4 components. Note: this is a little slow, factor these
> float->vector loads outside the hot loops as is practical.
>
>   float4 vecX = getX(vec); // we can make shorthand for this, 
> like
> 'vec.xxxx' for instance...
>   vecX += wideScalar; // all 4 components maintain the same 
> scalar value,
> this is so you can apply them back to non-scalar vectors later:
>
> With this, there are 2 typical uses, one is to scale another 
> vector by your
> scalar, for instance:
>   someOtherVector *= vecX; // perform a scale of a full 4d 
> vector by our
> 'wide' scalar
>
> The other, less common operation, is that you may want to 
> directly set the
> scalar to a component of another vector, setting Y to lock 
> something to a
> height map for instance:
>   someOtherVector = setY(someOtherVector, wideScalar); // note: 
> it is still
> important that you have a 'wide' scalar in this case for 
> portability, since
> different architectures have very different interleave 
> operations.
>
> Something like '.x' can never appear in efficient code.
> Sadly, most modern SIMD hardware is simply not able to 
> efficiently express
> what you as a programmer intuitively want as convenient 
> operations.
> Most SIMD hardware has absolutely no connection between the FPU 
> and the
> SIMD unit, resulting in loads and stores to memory, and this in 
> turn
> introduces another set of performance hazards.
> x64 is actually the only architecture that does allow 
> interaction between
> the FPU and SIMD however, although it's still no less efficient 
> to do it
> how I describe, and as a bonus, your code will be portable.

Okay, that makes a lot of sense and is inline with what I was 
reading last night about FPU/SSE assembly code. However I'm also 
a bit confused. At some point, like in your hightmap example, I'm 
going to need to do arithmetic work on single vector components. 
Is there some sort of SSE arithmetic/shuffle instruction which 
uses "masking" that I should use to isolate and manipulate 
components?

If not, and manipulating components is just bad for performance 
reasons, then I've figured out a solution to my original concern. 
By using this code:

@property @trusted pure nothrow
{
   auto ref x(T:float4)(auto ref T v) { return v.ptr[0]; }
   auto ref y(T:float4)(auto ref T v) { return v.ptr[1]; }
   auto ref z(T:float4)(auto ref T v) { return v.ptr[2]; }
   auto ref w(T:float4)(auto ref T v) { return v.ptr[3]; }

   void x(T:float4)(ref float4 v, float val) { v.ptr[0] = val; }
   void y(T:float4)(ref float4 v, float val) { v.ptr[1] = val; }
   void z(T:float4)(ref float4 v, float val) { v.ptr[2] = val; }
   void w(T:float4)(ref float4 v, float val) { v.ptr[3] = val; }
}

I am able to perform arithmetic on single components:

     auto vec = Vectors.float4(x, y, 0, 1); // factory
     vec.x += scalar; // += components

again, I'll abandon this approach if there's a better way to 
manipulate single components, like you mentioned above. I'm just 
not aware of how to do that using SSE instructions alone. I'll do 
more research, but would appreciate any insight you can give.

> Inline asm is usually less efficient for large blocks of code, 
> it requires
> that you hand-tune the opcode sequencing, which is very hard to 
> do,
> particularly for SSE.
> Small inline asm blocks are also usually less efficient, since 
> most
> compilers can't rearrange other code within the function around 
> the asm
> block, this leads to poor opcode sequencing.
> I recommend avoiding inline asm where performance is desired 
> unless you're
> confident in writing the ENTIRE function/loop in asm, and hand 
> tuning the
> opcode sequencing. But that's not portable...

Yes, after a bit of messing around with and researching ASM 
yesterday, I came to the conclusion that they're not a good fit 
for this. DMD can't inline functions with ASM blocks right now 
anyways (although LDC can), which would kill any performance 
gains SSE brings I'd imagine.

Plus, ASM is a pain in the ass. :-)

> Android? But you're benchmarking x64-SSE right? I don't think 
> it's
> reasonable to expect that performance characteristics for one 
> architectures
> SIMD hardware will be any indicator at all of how another 
> architecture may
> perform.

I only meant that, since Mono C# is what were using for our game 
code on any platform besides Windows/WP7/Xbox, and since Android 
has been really the only performance PITA for our Mono C# code, 
that upgrading our Vector libraries to use Mono.Simd should yield 
significant improvements there.

I'm just learning about SSE and proper vector utilization. In out 
last game we actually used Vector3's everywhere :-V , which even 
we should have know not too, because you have to convert them to 
float4's anyways to pass them into shader constants... I'm 
guessing this was our main performance issue with SmartPhones.. 
ahh, oh well.

> Also, if you're doing any of the stuff I've been warning 
> against above,
> NEON will suffer very hard, whereas x64-SSE will mostly shrug 
> it off.
>
> I'm very interested to hear your measurements when you try it 
> out!

I'll let you know if changing over to proper Vector code makes 
huge changes.

> wtf indeed! O_o
>
> Can you paste the disassembly?

I'm not sure how to do that with DMD. I remember GDC has a 
output-to-asm flag, but not DMD. Or is there an external tool you 
use to look at .o/.obj files?

> I can tell you this though, as soon as DMDs SIMD support is 
> able to do the
> missing stuff I need to complete std.simd, I shall do that, 
> along with
> intensive benchmarks where I'll be scrutinising the code-gen 
> very closely.
> I expect performance peculiarities like you are seeing will be 
> found and
> fixed at that time...

For now I've come to terms with using core.simd.float4 types 
directly have create acceptable solutions to my original 
problems. But I'm glad to here that in the future I'll have more 
flexibility within my libraries.