SIMD support...

Thu Jan 5 17:42:44 PST 2012

So I've been hassling about this for a while now, and Walter asked me to
pitch an email detailing a minimal implementation with some initial
thoughts.

The first thing I'd like to say is that a lot of people seem to have this
idea that float[4] should be specialised as a candidate for simd
optimisations somehow. It's obviously been discussed, and this general
opinion seems to be shared by a good few people here.
I've had a whole bunch of rants why I think this is wrong in other threads,
so I won't repeat them here... and that said, I'll attempt to detail an
approach based on explicit vector types.

So, what do we need...? A language defined primitive vector type... that's
all.

-- What shall we call it? --

Doesn't really matter... open to suggestions.
VisualC calls it __m128, XBox360 calls it __vector4, GCC calls it 'vector
float' (a name I particularly hate, not specifying any size, and trying to
associate it with a specific type)

I like v128, or something like that. I'll use that for the sake of this
document. I think it is preferable to float4 for a few reasons:
 * v128 says what the register intends to be, a general purpose 128bit
register that may be used for a variety of simd operations that aren't
necessarily type bound.
 * float4 implies it is a specific 4 component float type, which is not
what the raw type should be.
 * If we use names like float4, it stands to reason that (u)int4,
(u)short8, etc should also exist, and it also stands to reason that one
might expect math operators and such to be defined...

I suggest initial language definition and implementation of something like
v128, and then types like float4, (u)int4, etc, may be implemented in the
std library with complex behaviour like casting mechanics, and basic math
operators...

-- Alignment --

This type needs to be 16byte aligned. Unaligned loads/stores are very
expensive, and also tend to produce extremely costly LHS hazards on most
architectures when accessing vectors in arrays. If they are not aligned,
they are useless... honestly.

** Does this cause problems with class allocation? Are/can classes be
allocated to an alignment as inherited from an aligned member? ... If not,
this might be the bulk of the work.

There is one other problem I know of that is only of concern on x86.
In the C ABI, passing 16byte ALIGNED vectors by value is a problem,
since x86 ALWAYS uses the stack to pass arguments, and has no way to align
the stack.
I wonder if D can get creative with its ABI here, passing vectors in
registers, even though that's not conventional on x86... the C ABI was
invented long before these hardware features.
In lieu of that, x86 would (sadly) need to silently pass by const ref...
and also do this in the case of register overflow.

Every other architecture (including x64) is fine, since all other
architectures pass in regs, and can align the stack as needed when
overflowing the regs (since stack management is manual and not performed
with special opcodes).

-- What does this type do? --

The primitive v128 type DOES nothing... it is a type that facilitates the
compiler allocating SIMD registers, managing assignments, loads, and
stores, and allow passing to/from functions BY VALUE in registers.
Ie, the only valid operations would be:
  v128 myVec = someStruct.vecMember; // and vice versa...
  v128 result = someFunc(myVec); // and calling functions, passing by value.

Nice bonus: This alone is enough to allow implementation of fast memcpy
functions that copy 16 bytes at a time... ;)

-- So, it does nothing... so what good is it? --

Initially you could use this type in conjunction with inline asm, or
architecture intrinsics to do useful stuff. This would be using the
hardware totally raw, which is an important feature to have, but I imagine
most of the good stuff would come from libraries built on top of this.

-- Literal assignment --

This is a hairy one. Endian issues appear in 2 layers here...
Firstly, if you consider the vector to be 4 int's, the ints themselves may
be little or big endian, but in addition, the outer layer (ie. the order of
x,y,z,w) may also be in reverse order on some architectures... This makes a
single 128bit hex literal hard to apply.
I'll have a dig and try and confirm this, but I have a suspicion that VMX
defines its components reverse to other architectures... (Note: not usually
a problem in C, because vector code is sooo non-standard in C that this is
ALWAYS ifdef-ed for each platform anyway, and the literal syntax and order
can suit)

For the primitive v128 type, I generally like the idea of using a huge
128bit hex literal.
  v128 vec = 0x01234567_01234567_01234567_01234567; // yeah!! ;)

Since the primitive v128 type is effectively typeless, it makes no sense to
use syntax like this:
  v128 myVec = { 1.0f, 2.0f, 3.0f, 4.0f }; // syntax like this should be
reserved for use with a float4 type defined in a library somewhere.

... The problem is, this may not be linearly applicable to all hardware. If
the order of the components match the endian, then it is fine...
I suspect VMX orders the components reverse to match the fact the values
are big endian, which would be good, but I need to check. And if not...
then literals may need to get a lot more complicated :)

Assignment of literals to the primitive type IS actually important, it's
common to generate bit masks in these registers which are type-independent.
I also guess libraries still need to leverage this primitive assignment
functionality to assign their more complex literal expressions.

-- Libraries --

With this type, we can write some useful standard libraries. For a start,
we can consider adding float4, int4, etc, and make them more intelligent...
they would have basic maths operators defined, and probably implement type
conversion when casting between types.

  int4 intVec = floatVec; // perform a type conversion from float to int..
or vice versa... (perhaps we make this require an explicit cast?)

  v128 vec = floatVec; // implicit cast to the raw type always possible,
and does no type casting, just a reinterpret
  int4 intVec = vec; // conversely, the primitive type would implicitly
assign to other types.
  int4  intVec = (v128)floatVec; // piping through the primitive v128
allows to easily perform a reinterpret between vector types, rather than
the usual type conversion.

There are also a truckload of other operations that would be fleshed out.
For instance, strongly typed literal assignment, vector comparisons that
can be used with if() (usually these allow you to test if ALL components,
or if ANY components meet a given condition). Conventional logic operators
can't be neatly applied to vectors. You need to do something like this:
  if(std.simd.allGreater(v1, v2) && std.simd.anyLessOrEqual(v1, v3)) ...

We can discuss the libraries at a later date, but it's possible that you
might also want to make some advanced functions in the library that are
only supported on particular architectures, std.simd.sse...,
std.simd.vmx..., etc. which may be version()-ed.

-- Exceptions, flags, and error conditions --

SIMD units usually have their own control register for controlling various
behaviours, most importantly NaN policy and exception semantics...
I'm open to input here... what should be default behaviour?
I'll bet the D community opt for strict NaNs, and throw by default... but
it is actually VERY common to disable hardware exceptions when working with
SIMD code:
  * often precision is less important than speed when using SIMD, and some
SIMD units perform faster when these features are disabled.
  * most SIMD algorithms (at least in performance oriented code) are
designed to tolerate '0,0,0,0' as the result of a divide by zero, or some
other error condition.
  * realtime physics tends to suffer error creep and freaky random
explosions, and you can't have those crashing the program :) .. they're not
really 'errors', they're expected behaviour, often producing 0,0,0,0 as a
result, so they're easy to deal with.

I presume it'll end up being NaNs and throw by default, but we do need some
mechanism to change the SIMD unit flags for realtime use... A runtime
function? Perhaps a compiler switch (C does this sort of thing a lot)?

It's also worth noting that there are numerous SIMD units out there that
DON'T follow strict ieee float rules, and don't support NaNs or hardware
exceptions at all... others may simply set a divide-by-zero flag, but not
actually trigger a hardware exception, requiring you to explicitly check
the flag if you're interested.
Will it be okay that the languages default behaviour of NaN's and throws is
unsupported on such platforms? What are the implications of this?

-- Future --

AVX now exists, this is a 256 bit SIMD architecture. We simply add a v256
type, everything else is precisely the same.
I think this is perfectly reasonable... AVX is to SSE exactly as long is to
int, or double is to float. They are different types with different
register allocation and addressing semantics, and deserve a discreet type.
As with v128, libraries may then be created to allow the types to interact.

I know of 2 architectures that support 512bit (4x4 matrix) registers...
same story; implement a primitive type, then using intrinsics, we can build
interesting types in libraries.

We may also consider a v64 type, which would map to older MMX registers on
x86... there are also other architectures with 64bit 'vector' registers
(nintendo wii for one), supporting a pair of floats, or 4 shorts, etc...
Same general concept, but only 64 bits wide.

-- Conclusion --

I think that's about it for a start. I don't think it's particularly a lot
of work, the potential trouble points are 16byte alignment, and literal
expression. Potential issues relating to language guarantees of
exception/error conditions...
Go on, tear it apart!

Discuss...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/digitalmars-d/attachments/20120106/dac59323/attachment-0001.html>