SIMD support...

Thu Jan 5 20:28:36 PST 2012

On Fri, 06 Jan 2012 02:42:44 +0100, Manu <turkeyman at gmail.com> wrote:

> So I've been hassling about this for a while now, and Walter asked me to
> pitch an email detailing a minimal implementation with some initial
> thoughts.
>
> The first thing I'd like to say is that a lot of people seem to have this
> idea that float[4] should be specialised as a candidate for simd
> optimisations somehow. It's obviously been discussed, and this general
> opinion seems to be shared by a good few people here.
> I've had a whole bunch of rants why I think this is wrong in other  
> threads,
> so I won't repeat them here... and that said, I'll attempt to detail an
> approach based on explicit vector types.
>
> So, what do we need...? A language defined primitive vector type...  
> that's
> all.
>
>
> -- What shall we call it? --
>
> Doesn't really matter... open to suggestions.
> VisualC calls it __m128, XBox360 calls it __vector4, GCC calls it 'vector
> float' (a name I particularly hate, not specifying any size, and trying  
> to
> associate it with a specific type)
>
> I like v128, or something like that. I'll use that for the sake of this
> document. I think it is preferable to float4 for a few reasons:
>  * v128 says what the register intends to be, a general purpose 128bit
> register that may be used for a variety of simd operations that aren't
> necessarily type bound.
>  * float4 implies it is a specific 4 component float type, which is not
> what the raw type should be.
>  * If we use names like float4, it stands to reason that (u)int4,
> (u)short8, etc should also exist, and it also stands to reason that one
> might expect math operators and such to be defined...
>
> I suggest initial language definition and implementation of something  
> like
> v128, and then types like float4, (u)int4, etc, may be implemented in the
> std library with complex behaviour like casting mechanics, and basic math
> operators...
>
>
> -- Alignment --
>
> This type needs to be 16byte aligned. Unaligned loads/stores are very
> expensive, and also tend to produce extremely costly LHS hazards on most
> architectures when accessing vectors in arrays. If they are not aligned,
> they are useless... honestly.
>
Actually unaligned loads/stores are free if you have a recent core i5.
But then my processor has AVX support where loading/storing YMMs will
benefit from 32-byte alignment.
This will always be too system specific and volatile to make it a  
specialized type.

I also don't think that we can efficiently provide arbitrary alignment
for stack variables.
The performance penalty will kill your efforts.
Gcc doesn't do it either.

As a good alternative you should use a segmented stack  
(https://github.com/dsimcha/TempAlloc)
and ajust alignment to your needs.

Providing intrinsics should happen through library support.
Either through expression templates or with GPGPU in mind using
a DSL compiler for string mixins.

auto result = vectorize!q{
   auto v  = float4(a, b, c, d);
   auto v2 = float4(2 * a, 2.0, c - d, d + a);
   auto v3 = v * v2;
   auto v4 = __hadd(v3, v3);
   auto v5 = __hadd(v4, v4);
   return v5[0];
}(0.2, 0.2, 0.3, 0.4);

> ** Does this cause problems with class allocation? Are/can classes be
> allocated to an alignment as inherited from an aligned member? ... If  
> not,
> this might be the bulk of the work.
>
> There is one other problem I know of that is only of concern on x86.
> In the C ABI, passing 16byte ALIGNED vectors by value is a problem,
> since x86 ALWAYS uses the stack to pass arguments, and has no way to  
> align
> the stack.
> I wonder if D can get creative with its ABI here, passing vectors in
> registers, even though that's not conventional on x86... the C ABI was
> invented long before these hardware features.
> In lieu of that, x86 would (sadly) need to silently pass by const ref...
> and also do this in the case of register overflow.
>
> Every other architecture (including x64) is fine, since all other
> architectures pass in regs, and can align the stack as needed when
> overflowing the regs (since stack management is manual and not performed
> with special opcodes).
>
>
> -- What does this type do? --
>
> The primitive v128 type DOES nothing... it is a type that facilitates the
> compiler allocating SIMD registers, managing assignments, loads, and
> stores, and allow passing to/from functions BY VALUE in registers.
> Ie, the only valid operations would be:
>   v128 myVec = someStruct.vecMember; // and vice versa...
>   v128 result = someFunc(myVec); // and calling functions, passing by  
> value.
>
> Nice bonus: This alone is enough to allow implementation of fast memcpy
> functions that copy 16 bytes at a time... ;)
>
>
> -- So, it does nothing... so what good is it? --
>
> Initially you could use this type in conjunction with inline asm, or
> architecture intrinsics to do useful stuff. This would be using the
> hardware totally raw, which is an important feature to have, but I  
> imagine
> most of the good stuff would come from libraries built on top of this.
>
>
> -- Literal assignment --
>
> This is a hairy one. Endian issues appear in 2 layers here...
> Firstly, if you consider the vector to be 4 int's, the ints themselves  
> may
> be little or big endian, but in addition, the outer layer (ie. the order  
> of
> x,y,z,w) may also be in reverse order on some architectures... This  
> makes a
> single 128bit hex literal hard to apply.
> I'll have a dig and try and confirm this, but I have a suspicion that VMX
> defines its components reverse to other architectures... (Note: not  
> usually
> a problem in C, because vector code is sooo non-standard in C that this  
> is
> ALWAYS ifdef-ed for each platform anyway, and the literal syntax and  
> order
> can suit)
>
> For the primitive v128 type, I generally like the idea of using a huge
> 128bit hex literal.
>   v128 vec = 0x01234567_01234567_01234567_01234567; // yeah!! ;)
>
> Since the primitive v128 type is effectively typeless, it makes no sense  
> to
> use syntax like this:
>   v128 myVec = { 1.0f, 2.0f, 3.0f, 4.0f }; // syntax like this should be
> reserved for use with a float4 type defined in a library somewhere.
>
> ... The problem is, this may not be linearly applicable to all hardware.  
> If
> the order of the components match the endian, then it is fine...
> I suspect VMX orders the components reverse to match the fact the values
> are big endian, which would be good, but I need to check. And if not...
> then literals may need to get a lot more complicated :)
>
> Assignment of literals to the primitive type IS actually important, it's
> common to generate bit masks in these registers which are  
> type-independent.
> I also guess libraries still need to leverage this primitive assignment
> functionality to assign their more complex literal expressions.
>
>
> -- Libraries --
>
> With this type, we can write some useful standard libraries. For a start,
> we can consider adding float4, int4, etc, and make them more  
> intelligent...
> they would have basic maths operators defined, and probably implement  
> type
> conversion when casting between types.
>
>   int4 intVec = floatVec; // perform a type conversion from float to  
> int..
> or vice versa... (perhaps we make this require an explicit cast?)
>
>   v128 vec = floatVec; // implicit cast to the raw type always possible,
> and does no type casting, just a reinterpret
>   int4 intVec = vec; // conversely, the primitive type would implicitly
> assign to other types.
>   int4  intVec = (v128)floatVec; // piping through the primitive v128
> allows to easily perform a reinterpret between vector types, rather than
> the usual type conversion.
>
> There are also a truckload of other operations that would be fleshed out.
> For instance, strongly typed literal assignment, vector comparisons that
> can be used with if() (usually these allow you to test if ALL components,
> or if ANY components meet a given condition). Conventional logic  
> operators
> can't be neatly applied to vectors. You need to do something like this:
>   if(std.simd.allGreater(v1, v2) && std.simd.anyLessOrEqual(v1, v3)) ...
>
> We can discuss the libraries at a later date, but it's possible that you
> might also want to make some advanced functions in the library that are
> only supported on particular architectures, std.simd.sse...,
> std.simd.vmx..., etc. which may be version()-ed.
>
>
> -- Exceptions, flags, and error conditions --
>
> SIMD units usually have their own control register for controlling  
> various
> behaviours, most importantly NaN policy and exception semantics...
> I'm open to input here... what should be default behaviour?
> I'll bet the D community opt for strict NaNs, and throw by default... but
> it is actually VERY common to disable hardware exceptions when working  
> with
> SIMD code:
>   * often precision is less important than speed when using SIMD, and  
> some
> SIMD units perform faster when these features are disabled.
>   * most SIMD algorithms (at least in performance oriented code) are
> designed to tolerate '0,0,0,0' as the result of a divide by zero, or some
> other error condition.
>   * realtime physics tends to suffer error creep and freaky random
> explosions, and you can't have those crashing the program :) .. they're  
> not
> really 'errors', they're expected behaviour, often producing 0,0,0,0 as a
> result, so they're easy to deal with.
>
> I presume it'll end up being NaNs and throw by default, but we do need  
> some
> mechanism to change the SIMD unit flags for realtime use... A runtime
> function? Perhaps a compiler switch (C does this sort of thing a lot)?
>
> It's also worth noting that there are numerous SIMD units out there that
> DON'T follow strict ieee float rules, and don't support NaNs or hardware
> exceptions at all... others may simply set a divide-by-zero flag, but not
> actually trigger a hardware exception, requiring you to explicitly check
> the flag if you're interested.
> Will it be okay that the languages default behaviour of NaN's and throws  
> is
> unsupported on such platforms? What are the implications of this?
>
>
> -- Future --
>
> AVX now exists, this is a 256 bit SIMD architecture. We simply add a v256
> type, everything else is precisely the same.
> I think this is perfectly reasonable... AVX is to SSE exactly as long is  
> to
> int, or double is to float. They are different types with different
> register allocation and addressing semantics, and deserve a discreet  
> type.
> As with v128, libraries may then be created to allow the types to  
> interact.
>
> I know of 2 architectures that support 512bit (4x4 matrix) registers...
> same story; implement a primitive type, then using intrinsics, we can  
> build
> interesting types in libraries.
>
> We may also consider a v64 type, which would map to older MMX registers  
> on
> x86... there are also other architectures with 64bit 'vector' registers
> (nintendo wii for one), supporting a pair of floats, or 4 shorts, etc...
> Same general concept, but only 64 bits wide.
>
>
> -- Conclusion --
>
> I think that's about it for a start. I don't think it's particularly a  
> lot
> of work, the potential trouble points are 16byte alignment, and literal
> expression. Potential issues relating to language guarantees of
> exception/error conditions...
> Go on, tear it apart!
>
> Discuss...