seeding the pot for 2.0 features [small vectors]

Mon Jan 29 17:22:53 PST 2007

Chad J wrote:
> Bill Baxter wrote:
>> Mikola Lysenko wrote:
>>
>>> Bill Baxter wrote:
>>>
>> Yep, I agree, but I thought that was exactly the gist of what this 
>> friend of mine was griping about.  As I understood it at the time, he 
>> was complaining that the CPU instructions are good at planar layout x 
>> x x x y y y y ... but not interleaved x y x y x y.
>>
>> If that's not the case, then great.
>>
>> --bb
> 
> Seems it's great.
> 
> It doesn't really matter what the underlying data is.  An SSE 
> instruction will add four 32-bit floats in parallel, nevermind whether 
> the floats are x x x x or x y z w.  What meaning the floats have is up 
> to the programmer.
> 
> Of course, channelwise operations will be faster in planer (EX: add 24 
> to all red values, don't spend time on the other channels), while 
> pixelwise operations will be faster in interleaved (EX: alpha blending) 
> - these facts don't have much to do with SIMD.
> 
> Maybe the guy from intel wanted to help planar pixelwise operations 
> (some mechanism to help the need to dereference 3-4 different places at 
> once) or help interleaved channelwise operations (only operate on every 
> fourth float in an array without having to do 4 mov/adds to fill a 128 
> bit register).

That could be.  I seem to remember now the specific thing we were 
talking about was transforming a batch of vectors.  Is there a good way 
  to do that with SSE stuff? I.e for a 4x4 matrix with rows M1,M2,M3,M4 
you want to do something like:

   foreach(i,v; vector_batch)
      out[i] = [dot(M1,v),dot(M2,v),dot(M3,v),dot(M4,v)];

Maybe it had to do with not being able to operate 'horizontally'.  E.g. 
to do a dot product you can multiply x y z w times a b c d easily, but 
then you need the sum of those.  Apparently SSE3 has some instructions 
to help this case some.  You can add  x+y and z+w in one step.

By the way, are there any good tutorials on programming with SIMD 
(specifically for Intel/AMD)?  Everytime I've looked I come up with 
pretty much nothing.  Googling for "SSE tutorial" doesn't result in much.

As far as making use of SIMD goes (in C++), I ran across this project 
that looks very promising, but have yet to give it a real try:
http://www.pixelglow.com/macstl/

--bb