seeding the pot for 2.0 features [small vectors]

Mon Jan 29 17:32:19 PST 2007

Chad J wrote:
> Bill Baxter wrote:
>> Mikola Lysenko wrote:
>>
>>> Bill Baxter wrote:
>>>
>>>> "Most CPUs today have *some* kind of SSE/Altivec type thing"
>>>>
>>>> That may be, but I've heard that at least SSE is really not that 
>>>> suited to many calculations -- especially ones in graphics.  
>>>> Something like you have to pack your data so that all the x 
>>>> components are together, and all y components together, and all z 
>>>> components together.  Rather than the way everyone normally stores 
>>>> these things as xyz, xyz.  Maybe Altivec, SSE2 and SSE3 fix that 
>>>> though.  At any rate I think maybe Intel's finally getting tired of 
>>>> being laughed at for their graphics performance so things are 
>>>> probably changing.
>>>>
>>>>
>>>
>>> I have never heard of any SIMD architecture where vectors works that 
>>> way.  On SSE, Altivec or MMX the components for the vectors are 
>>> always stored in contiguous memory.
>>
>>
>> Ok.  Well, I've never used any of these MMX/SSE/Altivec things myself, 
>> so it was just heresay.  But the source was someone I know in the 
>> graphics group at Intel.  I must have just misunderstood his gripe, in 
>> that case.
>>
>>> In terms of graphics, this is pretty much optimal.  Most 
>>> manipulations on vectors like rotations, normalization, cross product 
>>> etc. require access to all components simultaneously.  I honestly 
>>> don't know why you would want to split each of them into separate 
>>> buffers...
>>>
>>> Surely it is simpler to do something like this:
>>>
>>> x y z w x y z w x y z w ...
>>>
>>> vs.
>>>
>>> x x x x ... y y y y ... z z z z ... w w w ...
>>
>>
>>
>> Yep, I agree, but I thought that was exactly the gist of what this 
>> friend of mine was griping about.  As I understood it at the time, he 
>> was complaining that the CPU instructions are good at planar layout x 
>> x x x y y y y ... but not interleaved x y x y x y.
>>
>> If that's not the case, then great.
>>
>> --bb
> 
> Seems it's great.
> 
> It doesn't really matter what the underlying data is.  An SSE 
> instruction will add four 32-bit floats in parallel, nevermind whether 
> the floats are x x x x or x y z w.  What meaning the floats have is up 
> to the programmer.
> 
> Of course, channelwise operations will be faster in planer (EX: add 24 
> to all red values, don't spend time on the other channels), while 
> pixelwise operations will be faster in interleaved (EX: alpha blending) 
> - these facts don't have much to do with SIMD.
> 
> Maybe the guy from intel wanted to help planar pixelwise operations 
> (some mechanism to help the need to dereference 3-4 different places at 
> once) or help interleaved channelwise operations (only operate on every 
> fourth float in an array without having to do 4 mov/adds to fill a 128 
> bit register).

Sorry to keep harping on this, but here's an article that basically says 
exactly what my friend was saying.
http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2350

 From the article:
"""
The hadd and hsub instructions are horizontal additions and horizontal 
subtractions. These allow faster processing of data stored 
"horizontally" in (for example) vertex arrays. Here is a 4-element array 
of vertex structures.

     x1 y1 z1 w1 | x2 y2 z2 w2 | x3 y3 z3 w3 | x4 y4 z4 w4

SSE and SSE2 are organized such that performance is better when 
processing vertical data, or structures that contain arrays; for 
example, a vertex structure with 4-element arrays for each component:

     x1 x2 x3 x4
     y1 y2 y3 y4
     z1 z2 z3 z4
     w1 w2 w3 w4

Generally, the preferred organizational method for vertecies is the 
former. Under SSE2, the compiler (or very unfortunate programmer) would 
have to reorganize the data during processing.
"""

The article is talking about how hadd and hsub in SSE3 help to corrects 
the situation, but SSE3 isn't yet nearly as ubiquitous as SSE/SSE2, I 
would imagine.

--bb