SIMD support...

Fri Jan 6 07:29:01 PST 2012

On 6 January 2012 16:12, bearophile <bearophileHUGS at lycos.com> wrote:

> > I see. While you design, you need to think about the other features of D
> :-) Is it possible to mix CPU SIMD with D vector ops?
> >
> > __float4[10] a, b, c;
> > c[] = a[] + b[];
>
> And generally, if the D compiler receives just D vector ops, what's a good
> way for the compiler to map them efficiently (even if less efficiently than
> true SIMD operations written manually) to SIMD ops? Generally you can't ask
> all D programmers to use __float4, some of them will want to use just D
> vector ops, despite they are a less efficient, because they are simpler to
> use. So the duty of a good D compiler is to implement them too efficiently
> enough.
>

 I'm not clear what you mean, are you talking about D vectors of hardware
vectors, as in your example above? (no problem, see my last post)

Or are you talking about programmers who will prefer to use float[4]
instead of __float4? (this is what I think you're getting at?)...
Users who prefer to use float[4] are welcome to do so, but I think you are
mistaken when you assume this will be 'simpler to use'.. The rules for what
they can/can't do efficiently with a float[4] are extremely restrictive,
and it's also very unclear if/when they are violating said rules.
It will almost always be faster to let the float unit do all the work in
this case... Perhaps the compiler COULD apply some SIMD optimisations in
very specific cases, but this would require some
serious sophistication from the compiler to detect.

Some likely problems:
  * float[4] is not aligned, performing unaligned load/stores will require
a long sequence of carefully pipelines vector code to offset/break even on
that cost. If the sequence of ops is short, it will be faster to keep it in
the FPU.
  * float[4] allows component-wise access. This produces transfer of data
between the FPU and the SIMD unit. This may again negate the advantage of
using SIMD opcodes over the FPU directly.
  * loading a vectorised float[4] with floats calculated/stored on the FPU
produces the same hazards as above. SIMD regs should not be loaded with
data taken from the FPU if possible.
  * how do you express logic and comparisons? chances are people will write
arbitrary component-wise comparisons. This requires flushing the values out
from the SIMD regs back to the FPU for comparisons, again, negating any
advantages of SIMD calculation.

The hazard I refer to almost universally is that of swapping data between
register types. This is a low process, and breaks any possibility for
efficient pipelining.
FPU pipelines nicely:
  float[4] x; x += 1.0; // This will result in 4 sequential adds to
different registers, there are no data dependencies, this will pipeline
beautifully, one cycle after another. This is probably only 3 cycles longer
than a simd add, plus a small cost for the extra opcodes in the instruction
stream

Any time you need to swap register type, the pipeline is broken, imagine
something seemingly harmless, and totally logical like this:

float[4] hardwareVec; // compiler allows use of a hardware vector for
float[4]
float[1] = groundHeight; // we want to set Y explicitly, seems reasonable,
perhaps we're snapping a position to a ground plane or something...

This may be achieved in some way that looks something like this:
 * groundHeight must be stored to the stack
 * flush pipeline (wait for the data to arrive) (potentially long time)
 * UNALIGNED load from stack into a vector register (this may require an
additional operation to rotate the vector into the proper position after
loading on some architectures)
 * flush pipeline (wait for data to arrive)
 * loaded float needs to be merged with the existing vector, this can be
done in a variety of ways
   - use a permute operation [only some architectures support arbitrary
permute, VMX is best] (one opcode, but requires pre-loading of a separate
permute control register to describe the appropriate merge, this load may
be expensive, and the data must be available)
   - use a series of shifts (requires 2 shifts for X or W, 3 shifts for Y
or Z), doesn't require any additional loads from memory, but each of the
shifts are dependant operations, and must flush the pipeline between them
   - use a mask and OR the 2 vectors together (since applying masks to both
the source and target vectors can be pipelined in parallel, and only the
final OR requires flushing the pipeline...)
   - [ note: none of these options is ideal, and each may be preferable
based on context in different situations]
 * done

Congratulations, you've now set the Y component. At the cost of a LHS
though memory, potentially other loads from memory, and 5-10 flushes of the
pipeline summing hundreds, maybe thousands of wasted cpu cycles..
In this same amount of wasted time, you could have done a LOT of work with
the FPU directly.

Process of same operation using just the FPU:
  * FPU stores groundHeight (already in an FPU reg) to &float[1]
  * done

And if the value is an intermediate and never needs to be stored on the
stack, there's a chance the operation will be eliminated entirely, since
the value is already in a float reg, ready for use in the next operation :)

I think the take-away I'm trying to illustrate here is:
SIMD work and scalar word do NOT mix... any syntax that allows it is a
mistake. Users won't understand all the details and implications of the
seemingly trivial operations they perform, and shouldn't need to.
Auto-vectorisation of float[4] will be some amazingly sophisticated code,
and very temperamental. If the compiler detects it can make some
optimisation, great, but it will not be reliable from a user point of view,
and it won't be clear what to change to make the compiler do a better job.
It also still implies policy problems, ie, should float[4] be special cased
to be aligned(16) when no other array requires this? What about all the
different types? How to cast between then, what are the expected results?

I think it's best to forget about float[4] as a candidate for reliable
auto-vectorisation. Perhaps there's an opportunity for some nice little
compiler bonuses, but it should not be the languages window into efficient
use of the hardware.
Anyone using float[4] should accept that they are working with the FPU, and
they probably won't suffer much for it. If they want/need aggressive SIMD
optimisation, then they need to use the appropriate API, and understand, at
least a little bit, how the hardware works... Ideally the well-defined SIMD
API will make it easiest to do the right thing, and they won't need to know
all these hardware details to make good use of it.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/digitalmars-d/attachments/20120106/7b90da19/attachment.html>