SIMD support...
Martin Nowak
dawg at dawgfoto.de
Fri Jan 6 05:04:38 PST 2012
On Fri, 06 Jan 2012 13:56:58 +0100, Martin Nowak <dawg at dawgfoto.de> wrote:
> On Fri, 06 Jan 2012 09:43:30 +0100, Walter Bright
> <newshound2 at digitalmars.com> wrote:
>
>> On 1/5/2012 5:42 PM, Manu wrote:
>>> So I've been hassling about this for a while now, and Walter asked me
>>> to pitch
>>> an email detailing a minimal implementation with some initial thoughts.
>>
>> Takeaways:
>>
>> 1. SIMD behavior is going to be very machine specific.
>>
>> 2. Even trying to do something with + is fraught with peril, as integer
>> adds with SIMD can be saturated or unsaturated.
>>
>> 3. Trying to build all the details about how each of the various adds
>> and other ops work into the compiler/optimizer is a large undertaking.
>> D would have to support internally maybe a 100 or more new operators.
>>
>> So some simplification is in order, perhaps a low level layer that is
>> fairly extensible for new instructions, and for which a library can be
>> layered over for a more presentable interface. A half-formed idea of
>> mine is, taking a cue from yours:
>>
>> Declare one new basic type:
>>
>> __v128
>>
>> which represents the 16 byte aligned 128 bit vector type. The only
>> operations defined to work on it would be construction and assignment.
>> The __ prefix signals that it is non-portable.
>>
>> Then, have:
>>
>> import core.simd;
>>
>> which provides two functions:
>>
>> __v128 simdop(operator, __v128 op1);
>> __v128 simdop(operator, __v128 op1, __v128 op2);
>>
>> This will be a function built in to the compiler, at least for the x86.
>> (Other architectures can provide an implementation of it that simulates
>> its operation, but I doubt that it would be worth anyone's while to use
>> that.)
>>
>> The operators would be an enum listing of the SIMD opcodes,
>>
>> PFACC, PFADD, PFCMPEQ, etc.
>>
>> For:
>>
>> z = simdop(PFADD, x, y);
>>
>> the compiler would generate:
>>
>> MOV z,x
>> PFADD z,y
>>
>> The code generator knows enough about these instructions to do register
>> assignments reasonably optimally.
>>
>> What do you think? It ain't beeyoootiful, but it's implementable in a
>> reasonable amount of time, and it should make writing tight & fast SIMD
>> code without having to do it all in assembler.
>>
>> One caveat is it is typeless; a __v128 could be used as 4 packed ints
>> or 2 packed doubles. One problem with making it typed is it'll add 10
>> more types to the base compiler, instead of one. Maybe we should just
>> bite the bullet and do the types:
>>
>> __vdouble2
>> __vfloat4
>> __vlong2
>> __vulong2
>> __vint4
>> __vuint4
>> __vshort8
>> __vushort8
>> __vbyte16
>> __vubyte16
>
> Those could be typedefs, i.e. alias this wrapper.
> Still simdop would not be typesafe.
>
> As much as this proposal presents a viable solution,
> why not spending the time to extend inline asm.
>
> void foo()
> {
> __v128 a = loadss(1.0f);
> __v128 b = loadss(1.0f);
> a = addss(a, b);
> }
>
> __v128 load(float v)
> {
> __v128 res; // allocates register
> asm
> {
> movss res, v[RBP];
> }
> return res; // return in XMM1 but inlineable return assignment
> }
>
> __v128 addss(__v128 a, __v128 b) // passed in XMM0, XMM1 but inlineable
> {
> __v128 res = a;
> // asm prolog, allocates registers for every __v128 used within the
> asm
> asm
> {
> addss res, b;
> }
> // asm epilog, possibly restore spilled registers
> return res;
> }
>
> What would be needed?
> - Implement the asm allocation logic.
> - Functions containing asm statements should participate in inlining.
> - Determining inline cost of asm statements.
>
> When being used with typedefs for __vubyte16 et.al. this would
> allow a really clean and simple library implementation of intrinsics.
Also addss is a pure function which could be important to optimize
out certain calls. Maybe we should allow to attribute asm with pure.
More information about the Digitalmars-d
mailing list