seeding the pot for 2.0 features [small vectors]

Mon Jan 29 16:49:07 PST 2007

Bill Baxter wrote:
> Mikola Lysenko wrote:
> 
>> Bill Baxter wrote:
>>
>>> "Most CPUs today have *some* kind of SSE/Altivec type thing"
>>>
>>> That may be, but I've heard that at least SSE is really not that 
>>> suited to many calculations -- especially ones in graphics.  
>>> Something like you have to pack your data so that all the x 
>>> components are together, and all y components together, and all z 
>>> components together.  Rather than the way everyone normally stores 
>>> these things as xyz, xyz.  Maybe Altivec, SSE2 and SSE3 fix that 
>>> though.  At any rate I think maybe Intel's finally getting tired of 
>>> being laughed at for their graphics performance so things are 
>>> probably changing.
>>>
>>>
>>
>> I have never heard of any SIMD architecture where vectors works that 
>> way.  On SSE, Altivec or MMX the components for the vectors are always 
>> stored in contiguous memory.
> 
> 
> Ok.  Well, I've never used any of these MMX/SSE/Altivec things myself, 
> so it was just heresay.  But the source was someone I know in the 
> graphics group at Intel.  I must have just misunderstood his gripe, in 
> that case.
> 
>> In terms of graphics, this is pretty much optimal.  Most manipulations 
>> on vectors like rotations, normalization, cross product etc. require 
>> access to all components simultaneously.  I honestly don't know why 
>> you would want to split each of them into separate buffers...
>>
>> Surely it is simpler to do something like this:
>>
>> x y z w x y z w x y z w ...
>>
>> vs.
>>
>> x x x x ... y y y y ... z z z z ... w w w ...
> 
> 
> 
> Yep, I agree, but I thought that was exactly the gist of what this 
> friend of mine was griping about.  As I understood it at the time, he 
> was complaining that the CPU instructions are good at planar layout x x 
> x x y y y y ... but not interleaved x y x y x y.
> 
> If that's not the case, then great.
> 
> --bb

Seems it's great.

It doesn't really matter what the underlying data is.  An SSE 
instruction will add four 32-bit floats in parallel, nevermind whether 
the floats are x x x x or x y z w.  What meaning the floats have is up 
to the programmer.

Of course, channelwise operations will be faster in planer (EX: add 24 
to all red values, don't spend time on the other channels), while 
pixelwise operations will be faster in interleaved (EX: alpha blending) 
- these facts don't have much to do with SIMD.

Maybe the guy from intel wanted to help planar pixelwise operations 
(some mechanism to help the need to dereference 3-4 different places at 
once) or help interleaved channelwise operations (only operate on every 
fourth float in an array without having to do 4 mov/adds to fill a 128 
bit register).