SIMD support...

Fri Jan 6 04:56:58 PST 2012

On Fri, 06 Jan 2012 09:43:30 +0100, Walter Bright  
<newshound2 at digitalmars.com> wrote:

> On 1/5/2012 5:42 PM, Manu wrote:
>> So I've been hassling about this for a while now, and Walter asked me  
>> to pitch
>> an email detailing a minimal implementation with some initial thoughts.
>
> Takeaways:
>
> 1. SIMD behavior is going to be very machine specific.
>
> 2. Even trying to do something with + is fraught with peril, as integer  
> adds with SIMD can be saturated or unsaturated.
>
> 3. Trying to build all the details about how each of the various adds  
> and other ops work into the compiler/optimizer is a large undertaking. D  
> would have to support internally maybe a 100 or more new operators.
>
> So some simplification is in order, perhaps a low level layer that is  
> fairly extensible for new instructions, and for which a library can be  
> layered over for a more presentable interface. A half-formed idea of  
> mine is, taking a cue from yours:
>
> Declare one new basic type:
>
>      __v128
>
> which represents the 16 byte aligned 128 bit vector type. The only  
> operations defined to work on it would be construction and assignment.  
> The __ prefix signals that it is non-portable.
>
> Then, have:
>
>     import core.simd;
>
> which provides two functions:
>
>     __v128 simdop(operator, __v128 op1);
>     __v128 simdop(operator, __v128 op1, __v128 op2);
>
> This will be a function built in to the compiler, at least for the x86.  
> (Other architectures can provide an implementation of it that simulates  
> its operation, but I doubt that it would be worth anyone's while to use  
> that.)
>
> The operators would be an enum listing of the SIMD opcodes,
>
>      PFACC, PFADD, PFCMPEQ, etc.
>
> For:
>
>      z = simdop(PFADD, x, y);
>
> the compiler would generate:
>
>      MOV z,x
>      PFADD z,y
>
> The code generator knows enough about these instructions to do register  
> assignments reasonably optimally.
>
> What do you think? It ain't beeyoootiful, but it's implementable in a  
> reasonable amount of time, and it should make writing tight & fast SIMD  
> code without having to do it all in assembler.
>
> One caveat is it is typeless; a __v128 could be used as 4 packed ints or  
> 2 packed doubles. One problem with making it typed is it'll add 10 more  
> types to the base compiler, instead of one. Maybe we should just bite  
> the bullet and do the types:
>
>      __vdouble2
>      __vfloat4
>      __vlong2
>      __vulong2
>      __vint4
>      __vuint4
>      __vshort8
>      __vushort8
>      __vbyte16
>      __vubyte16

Those could be typedefs, i.e. alias this wrapper.
Still simdop would not be typesafe.

As much as this proposal presents a viable solution,
why not spending the time to extend inline asm.

void foo()
{
     __v128 a = loadss(1.0f);
     __v128 b = loadss(1.0f);
     a = addss(a, b);
}

__v128 load(float v)
{
     __v128 res; // allocates register
     asm
     {
         movss res, v[RBP];
     }
     return res; // return in XMM1 but inlineable return assignment
}

__v128 addss(__v128 a, __v128 b) // passed in XMM0, XMM1 but inlineable
{
     __v128 res = a;
     // asm prolog, allocates registers for every __v128 used within the asm
     asm
     {
         addss res, b;
     }
     // asm epilog, possibly restore spilled registers
     return res;
}

What would be needed?
  - Implement the asm allocation logic.
  - Functions containing asm statements should participate in inlining.
  - Determining inline cost of asm statements.

When being used with typedefs for __vubyte16 et.al. this would
allow a really clean and simple library implementation of intrinsics.