SIMD support...
Martin Nowak
dawg at dawgfoto.de
Fri Jan 6 04:56:58 PST 2012
On Fri, 06 Jan 2012 09:43:30 +0100, Walter Bright
<newshound2 at digitalmars.com> wrote:
> On 1/5/2012 5:42 PM, Manu wrote:
>> So I've been hassling about this for a while now, and Walter asked me
>> to pitch
>> an email detailing a minimal implementation with some initial thoughts.
>
> Takeaways:
>
> 1. SIMD behavior is going to be very machine specific.
>
> 2. Even trying to do something with + is fraught with peril, as integer
> adds with SIMD can be saturated or unsaturated.
>
> 3. Trying to build all the details about how each of the various adds
> and other ops work into the compiler/optimizer is a large undertaking. D
> would have to support internally maybe a 100 or more new operators.
>
> So some simplification is in order, perhaps a low level layer that is
> fairly extensible for new instructions, and for which a library can be
> layered over for a more presentable interface. A half-formed idea of
> mine is, taking a cue from yours:
>
> Declare one new basic type:
>
> __v128
>
> which represents the 16 byte aligned 128 bit vector type. The only
> operations defined to work on it would be construction and assignment.
> The __ prefix signals that it is non-portable.
>
> Then, have:
>
> import core.simd;
>
> which provides two functions:
>
> __v128 simdop(operator, __v128 op1);
> __v128 simdop(operator, __v128 op1, __v128 op2);
>
> This will be a function built in to the compiler, at least for the x86.
> (Other architectures can provide an implementation of it that simulates
> its operation, but I doubt that it would be worth anyone's while to use
> that.)
>
> The operators would be an enum listing of the SIMD opcodes,
>
> PFACC, PFADD, PFCMPEQ, etc.
>
> For:
>
> z = simdop(PFADD, x, y);
>
> the compiler would generate:
>
> MOV z,x
> PFADD z,y
>
> The code generator knows enough about these instructions to do register
> assignments reasonably optimally.
>
> What do you think? It ain't beeyoootiful, but it's implementable in a
> reasonable amount of time, and it should make writing tight & fast SIMD
> code without having to do it all in assembler.
>
> One caveat is it is typeless; a __v128 could be used as 4 packed ints or
> 2 packed doubles. One problem with making it typed is it'll add 10 more
> types to the base compiler, instead of one. Maybe we should just bite
> the bullet and do the types:
>
> __vdouble2
> __vfloat4
> __vlong2
> __vulong2
> __vint4
> __vuint4
> __vshort8
> __vushort8
> __vbyte16
> __vubyte16
Those could be typedefs, i.e. alias this wrapper.
Still simdop would not be typesafe.
As much as this proposal presents a viable solution,
why not spending the time to extend inline asm.
void foo()
{
__v128 a = loadss(1.0f);
__v128 b = loadss(1.0f);
a = addss(a, b);
}
__v128 load(float v)
{
__v128 res; // allocates register
asm
{
movss res, v[RBP];
}
return res; // return in XMM1 but inlineable return assignment
}
__v128 addss(__v128 a, __v128 b) // passed in XMM0, XMM1 but inlineable
{
__v128 res = a;
// asm prolog, allocates registers for every __v128 used within the asm
asm
{
addss res, b;
}
// asm epilog, possibly restore spilled registers
return res;
}
What would be needed?
- Implement the asm allocation logic.
- Functions containing asm statements should participate in inlining.
- Determining inline cost of asm statements.
When being used with typedefs for __vubyte16 et.al. this would
allow a really clean and simple library implementation of intrinsics.
More information about the Digitalmars-d
mailing list