primitive vector types

Don nospam at nospam.com
Mon Feb 23 02:52:29 PST 2009


Bill Baxter wrote:
> On Mon, Feb 23, 2009 at 5:18 PM, Don <nospam at nospam.com> wrote:
>> Mattias Holm wrote:
>>> On 2009-02-21 17:03:06 +0100, Don <nospam at nospam.com> said:
>>>> I don't think that's messy at all. I can't see much difference between
>>>> special support for float[4] versus float4. It's better if the code can take
>>>> advantage of hardware without specific support. Bear in mind that SSE/SSE2
>>>> is a temporary situation. AVX provides for much longer arrays of vectors;
>>>> and it's extensible. You'd end up needing to keep adding on special types
>>>> whenever a new CPU comes out.
>>>>
>>>> Note that the fundamental concept which is missing from the C virtual
>>>> machine is that all modern machines can efficiently perform operations on
>>>> arrays of built-in types of length 2^n, for some small value of n.
>>>> We need to get this into the language abstraction. Not follow C++ in
>>>> hacking a few extra special types onto the old, deficient C model. And I
>>>> think D is actually in a position to do this.
>>>>
>>>> float[4] would be a greatly superior option if it could be done.
>>>> The key requirements are:
>>>> (1) need to specify that static arrays are passed by value.
>>>> (2) need to keep stack aligned to 16.
>>>> The good news is that both of these appear to be done on DMD2-Mac!
>>> Yes, float[4] would be ok, if some CPU independent permutation support can
>>> be added. Would this be with some intrinsic then or what? I very much like
>>> the OpenCL syntax for permutation, but I suppose that an intrinsic such as
>>> "float[4] noref permute(float[4] noref vec, int newPos0, int newPos1, int
>>> newPos2, int newPos3)" would work as well. Note that this should also work
>>> with double[2], byte[16], short[8] and int[4].
>> Note that if you had static arrays with value semantics, with proper
>> alignment, then you could simply create
>>
>> module std.swizzle;
>> float[4] permute(float[4] vec, int newPos0, int newPos1, int newPos2, int
>> newPos3);  /* intrinsic */
>>
>> float[4] wzyx(float[4] q) { return permute(q, 4, 3, 2, 1); }
>> float[4] xywz(float[4] q) { return permute(q, 1, 2, 4, 3); }
>> // etc
>>
>> ---
>> and your code would be:
>>
>> import std.swizzle;
>>
>> void main()
>> {
>>   float[4] t;
>>   auto u = t.wzyx;
>> }
>>
>> I don't think this is terribly difficult once the value semantics are in
>> place.
>> (Note that once you get beyond 4 members, the .xyzw syntax gives an
>> explosion of functions; but I think it's workable at 4; 4! is only 24.
>> Beyond that point, you'd probably require direct permute calls).
> 
> Actually its 4^4 if you do it like OpenCL/GLSL/HLSL/Cg and allow
> repeats like .xxyy.

Yes. Is the syntax sugar actually needed for all the permutations?
Even so, it's still only 256, which is probably still OK. I don't think 
a language change is required.

This scheme doesn't cover:
* shufp  where the two sources are different
* haddpd, haddps [SSE3] { double[2] a, b;  a[0]=a[0]+a[1]; a[1]=b[0]+b[1]; }
* non-temporal stores (although I think these are covered adequately by 
array operations)

and the byte/word operations:

* pack with saturation
* movmsk
* avg
* multiply and add.

So it looks to me as though with the minimal language changes, we could 
get almost complete SIMD support, with excellent syntax.




More information about the Digitalmars-d mailing list