seeding the pot for 2.0 features [small vectors]

Mon Jan 29 18:40:14 PST 2007

Bill Baxter wrote:
> Chad J wrote:
> 
>> Bill Baxter wrote:
>>
>>> Mikola Lysenko wrote:
>>>
>>>> Bill Baxter wrote:
>>>>
>>>>> "Most CPUs today have *some* kind of SSE/Altivec type thing"
>>>>>
>>>>> That may be, but I've heard that at least SSE is really not that 
>>>>> suited to many calculations -- especially ones in graphics.  
>>>>> Something like you have to pack your data so that all the x 
>>>>> components are together, and all y components together, and all z 
>>>>> components together.  Rather than the way everyone normally stores 
>>>>> these things as xyz, xyz.  Maybe Altivec, SSE2 and SSE3 fix that 
>>>>> though.  At any rate I think maybe Intel's finally getting tired of 
>>>>> being laughed at for their graphics performance so things are 
>>>>> probably changing.
>>>>>
>>>>>
>>>>
>>>> I have never heard of any SIMD architecture where vectors works that 
>>>> way.  On SSE, Altivec or MMX the components for the vectors are 
>>>> always stored in contiguous memory.
>>>
>>>
>>>
>>> Ok.  Well, I've never used any of these MMX/SSE/Altivec things 
>>> myself, so it was just heresay.  But the source was someone I know in 
>>> the graphics group at Intel.  I must have just misunderstood his 
>>> gripe, in that case.
>>>
>>>> In terms of graphics, this is pretty much optimal.  Most 
>>>> manipulations on vectors like rotations, normalization, cross 
>>>> product etc. require access to all components simultaneously.  I 
>>>> honestly don't know why you would want to split each of them into 
>>>> separate buffers...
>>>>
>>>> Surely it is simpler to do something like this:
>>>>
>>>> x y z w x y z w x y z w ...
>>>>
>>>> vs.
>>>>
>>>> x x x x ... y y y y ... z z z z ... w w w ...
>>>
>>>
>>>
>>>
>>> Yep, I agree, but I thought that was exactly the gist of what this 
>>> friend of mine was griping about.  As I understood it at the time, he 
>>> was complaining that the CPU instructions are good at planar layout x 
>>> x x x y y y y ... but not interleaved x y x y x y.
>>>
>>> If that's not the case, then great.
>>>
>>> --bb
>>
>>
>> Seems it's great.
>>
>> It doesn't really matter what the underlying data is.  An SSE 
>> instruction will add four 32-bit floats in parallel, nevermind whether 
>> the floats are x x x x or x y z w.  What meaning the floats have is up 
>> to the programmer.
>>
>> Of course, channelwise operations will be faster in planer (EX: add 24 
>> to all red values, don't spend time on the other channels), while 
>> pixelwise operations will be faster in interleaved (EX: alpha 
>> blending) - these facts don't have much to do with SIMD.
>>
>> Maybe the guy from intel wanted to help planar pixelwise operations 
>> (some mechanism to help the need to dereference 3-4 different places 
>> at once) or help interleaved channelwise operations (only operate on 
>> every fourth float in an array without having to do 4 mov/adds to fill 
>> a 128 bit register).
> 
> 
> 
> Sorry to keep harping on this, but here's an article that basically says 
> exactly what my friend was saying.
> http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2350
> 
>  From the article:
> """
> The hadd and hsub instructions are horizontal additions and horizontal 
> subtractions. These allow faster processing of data stored 
> "horizontally" in (for example) vertex arrays. Here is a 4-element array 
> of vertex structures.
> 
>     x1 y1 z1 w1 | x2 y2 z2 w2 | x3 y3 z3 w3 | x4 y4 z4 w4
> 
> SSE and SSE2 are organized such that performance is better when 
> processing vertical data, or structures that contain arrays; for 
> example, a vertex structure with 4-element arrays for each component:
> 
>     x1 x2 x3 x4
>     y1 y2 y3 y4
>     z1 z2 z3 z4
>     w1 w2 w3 w4
> 
> Generally, the preferred organizational method for vertecies is the 
> former. Under SSE2, the compiler (or very unfortunate programmer) would 
> have to reorganize the data during processing.
> """
> 
> The article is talking about how hadd and hsub in SSE3 help to corrects 
> the situation, but SSE3 isn't yet nearly as ubiquitous as SSE/SSE2, I 
> would imagine.
> 
> --bb

That makes a lot of sense.

I also remember running into trouble finding material on SSE as well.  I 
never really got past looking at what all of the instructions do, or 
maybe implementing an algorithm or two.  I would have needed the SSE2 
instructions to do the integer stuff that I wanted to do, and I don't 
think I even had those on my old computer when I was doing this stuff :/ 
  For my purposes, MMX was much easier to use and find resources for.

You'll probably have better luck searching for "SSE Instruction Set" and 
just messing around with the instructions (probably what I'd do).  There 
should also be some (probably meager) Intel documentation and comments 
on SSE.

Here are some pages I found:
http://softpixel.com/~cwright/programming/simd/sse.php
http://www.cpuid.com/sse.php
http://www.hayestechnologies.com/en/techsimd.htm