core.simd woes

Tue Oct 2 01:56:12 PDT 2012

On 8 August 2012 04:45, F i L <witte2008 at gmail.com> wrote:

> Manu wrote:
>
>> I'm not sure why the performance would suffer when placing it in a struct.
>> I suspect it's because the struct causes the vectors to become unaligned,
>> and that impacts performance a LOT. Walter has recently made some changes
>> to expand the capability of align() to do most of the stuff you expect
>> should be possible, including aligning structs, and propogating alignment
>> from a struct member to its containing struct. This change might actually
>> solve your problems...
>>
>
> I've tried all combinations with align() before and inside the struct,
> with no luck. I'm using DMD 2.060, so unless there's a new syntax I'm
> unaware of, I don't think it's been adjusted to fix any alignment issues
> with SIMD stuff. It would be great to be able to wrap float4 into a struct,
> but for now I've come up with an easy and understandable alternative using
> SIMD types directly.

I actually haven't had time to try out the new 2.60 alignment changes in
practise yet. As a Win64 D user, I'm stuck with compilers that are forever
3-6 months out of date (2.58). >_<
The use cases I required to do this stuff efficiently were definitely
agreed though by Walter, and to my knowledge, implemented... so it might be
some other subtle details.
It's possible that the intrinsic vector code-gen is hard-coded to use
unaligned loads too. You might need to assert appropriate alignment, and
then issue the movaps intrinsics directly, but I'm sure DMD can be fixed to
emit movaps when it detects the vector is aligned >= 16 bytes.

[*clip* portability and cross-lane efficiency *clip*]
>>
>
> Okay, that makes a lot of sense and is inline with what I was reading last
> night about FPU/SSE assembly code. However I'm also a bit confused. At some
> point, like in your hightmap example, I'm going to need to do arithmetic
> work on single vector components. Is there some sort of SSE
> arithmetic/shuffle instruction which uses "masking" that I should use to
> isolate and manipulate components?
>

Well, actually, height maps are one thing that hardware SIMD units aren't
intrinsically good at, because they do specifically require component-wise
access.
That said, there are still lots of interesting possibilities.

If you're operating on a height map, for instance, rather than looping over
the position vectors, fetching y from it, doing something with y (which I
presume involves foreign data?), then putting it back, and repeating over
the next vector...

Do something like:

  align(16) float[as-many-as-there-are-verts] height_offsets;
  foreach(h; height_offsets)
  {
    // do some work to generate deltas for each vertex...
  }

Now what you can probably do is unpack them and apply them directly to the
vertex stream in a single pass:

  for(i = 0; i < numVerts; i += 4)
  {
    four_heights = loadaps(&height_offsets[i]);

    float4[4] heights;
    // do some shuffling/unpacking to result: four_heights.xyzw ->
height[0].y, height[1].y, height[2].y, height[3].y (fiddly, but simple
enough)

    vertices[i + 0] += height[0];
    vertices[i + 1] += height[1];
    vertices[i + 2] += height[2];
    vertices[i + 3] += height[3];
  }

... I'm sure that could be improved, but you get the idea..? This approach
should pipeline well, have reasonably low bandwidth, make good use of
registers, and you can see there is no interaction between the FPU and SIMD
unit.

Disclaimer: I just made that up off the top of my head ;), but that
illustrates the kind of approach you would usually take in efficient (and
portable) SIMD code.

If not, and manipulating components is just bad for performance reasons,
> then I've figured out a solution to my original concern. By using this code:
>
> @property @trusted pure nothrow
> {
>   auto ref x(T:float4)(auto ref T v) { return v.ptr[0]; }
>   auto ref y(T:float4)(auto ref T v) { return v.ptr[1]; }
>   auto ref z(T:float4)(auto ref T v) { return v.ptr[2]; }
>   auto ref w(T:float4)(auto ref T v) { return v.ptr[3]; }
>
>   void x(T:float4)(ref float4 v, float val) { v.ptr[0] = val; }
>   void y(T:float4)(ref float4 v, float val) { v.ptr[1] = val; }
>   void z(T:float4)(ref float4 v, float val) { v.ptr[2] = val; }
>   void w(T:float4)(ref float4 v, float val) { v.ptr[3] = val; }
> }
>

This is fine if your vectors are in memory to begin with. But if you're
already doing work on them, and they are in registers/local variables, this
is the worst thing you can do.
It's a generally bad practise, and someone who isn't careful with their
usage will produce very slow code (on some platforms).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/digitalmars-d/attachments/20121002/afbf0118/attachment.html>