SIMD support...

Martin Nowak dawg at
Fri Jan 6 05:04:38 PST 2012

On Fri, 06 Jan 2012 13:56:58 +0100, Martin Nowak <dawg at> wrote:

> On Fri, 06 Jan 2012 09:43:30 +0100, Walter Bright  
> <newshound2 at> wrote:
>> On 1/5/2012 5:42 PM, Manu wrote:
>>> So I've been hassling about this for a while now, and Walter asked me  
>>> to pitch
>>> an email detailing a minimal implementation with some initial thoughts.
>> Takeaways:
>> 1. SIMD behavior is going to be very machine specific.
>> 2. Even trying to do something with + is fraught with peril, as integer  
>> adds with SIMD can be saturated or unsaturated.
>> 3. Trying to build all the details about how each of the various adds  
>> and other ops work into the compiler/optimizer is a large undertaking.  
>> D would have to support internally maybe a 100 or more new operators.
>> So some simplification is in order, perhaps a low level layer that is  
>> fairly extensible for new instructions, and for which a library can be  
>> layered over for a more presentable interface. A half-formed idea of  
>> mine is, taking a cue from yours:
>> Declare one new basic type:
>>      __v128
>> which represents the 16 byte aligned 128 bit vector type. The only  
>> operations defined to work on it would be construction and assignment.  
>> The __ prefix signals that it is non-portable.
>> Then, have:
>>     import core.simd;
>> which provides two functions:
>>     __v128 simdop(operator, __v128 op1);
>>     __v128 simdop(operator, __v128 op1, __v128 op2);
>> This will be a function built in to the compiler, at least for the x86.  
>> (Other architectures can provide an implementation of it that simulates  
>> its operation, but I doubt that it would be worth anyone's while to use  
>> that.)
>> The operators would be an enum listing of the SIMD opcodes,
>>      PFACC, PFADD, PFCMPEQ, etc.
>> For:
>>      z = simdop(PFADD, x, y);
>> the compiler would generate:
>>      MOV z,x
>>      PFADD z,y
>> The code generator knows enough about these instructions to do register  
>> assignments reasonably optimally.
>> What do you think? It ain't beeyoootiful, but it's implementable in a  
>> reasonable amount of time, and it should make writing tight & fast SIMD  
>> code without having to do it all in assembler.
>> One caveat is it is typeless; a __v128 could be used as 4 packed ints  
>> or 2 packed doubles. One problem with making it typed is it'll add 10  
>> more types to the base compiler, instead of one. Maybe we should just  
>> bite the bullet and do the types:
>>      __vdouble2
>>      __vfloat4
>>      __vlong2
>>      __vulong2
>>      __vint4
>>      __vuint4
>>      __vshort8
>>      __vushort8
>>      __vbyte16
>>      __vubyte16
> Those could be typedefs, i.e. alias this wrapper.
> Still simdop would not be typesafe.
> As much as this proposal presents a viable solution,
> why not spending the time to extend inline asm.
> void foo()
> {
>      __v128 a = loadss(1.0f);
>      __v128 b = loadss(1.0f);
>      a = addss(a, b);
> }
> __v128 load(float v)
> {
>      __v128 res; // allocates register
>      asm
>      {
>          movss res, v[RBP];
>      }
>      return res; // return in XMM1 but inlineable return assignment
> }
> __v128 addss(__v128 a, __v128 b) // passed in XMM0, XMM1 but inlineable
> {
>      __v128 res = a;
>      // asm prolog, allocates registers for every __v128 used within the  
> asm
>      asm
>      {
>          addss res, b;
>      }
>      // asm epilog, possibly restore spilled registers
>      return res;
> }
> What would be needed?
>   - Implement the asm allocation logic.
>   - Functions containing asm statements should participate in inlining.
>   - Determining inline cost of asm statements.
> When being used with typedefs for __vubyte16 this would
> allow a really clean and simple library implementation of intrinsics.

Also addss is a pure function which could be important to optimize
out certain calls. Maybe we should allow to attribute asm with pure.

More information about the Digitalmars-d mailing list