OOP, faster data layouts, compilers

Fri Apr 22 18:12:08 PDT 2011

On 4/22/2011 4:41 PM, bearophile wrote:
> Sean Cavanaugh:
>
>> In many ways the biggest thing I use regularly in game development that
>> I would lose by moving to D would be good built-in SIMD support.  The PC
>> compilers from MS and Intel both have intrinsic data types and
>> instructions that cover all the operations from SSE1 up to AVX.  The
>> intrinsics are nice in that the job of register allocation and
>> scheduling is given to the compiler and generally the code it outputs is
>> good enough (though it needs to be watched at times).
>
> This is a topic quite different from the one I was talking about, but it's an interesting topic :-)
>
> SIMD intrinsics look ugly, they add lot of noise to the code, and are very specific to one CPU, or instruction set. You can't design a clean language with hundreds of those. Once 256 or 512 bit registers come, you need to add new intrinsics and change your code to use them. This is not so good.

In C++ the intrinsics are easily wrapped by __forceinline global 
functions, to provide a platform abstraction against the intrinsics.

Then, you can write class wrappers to provide the most common level of 
functionality, which boils down to a class to do vectorized math 
operators for + - * / and vectorized comparison functions == != >= <= < 
and >.  From HLSL you have to borrow the 'any' and 'all' statements 
(along with variations for every permutation of the bitmask of the test 
result) to do conditional branching for the tests.  This pretty much 
leaves swizzle/shuffle/permuting and outlying features (8,16,64 bit 
integers) in the realm of 'ugly'.

 From here you could build up portable SIMD transcendental functions 
(sin, cos, pow, log, etc), and other libraries (matrix multiplication, 
inversion, quaternions etc).

I would say in D this could be faked provided the language at a minimum 
understood what a 128 (SSE1 through 4.2) and 256 bit value (AVX) was and 
how to efficiently move it via registers for function calls.  Kind of 
'make it at least work in the ABI, come back to a good implementation 
later' solution.  There is some room to beat Microsoft here, as the the 
code visual studio 2010 outputs currently for 64 bit environments cannot 
pass 128 bit SIMD values by register (forceinline functions are the only 
workaround), even though scalar 32 and 64 bit float values are passed by 
XMM register just fine.

The current hardware landscape dictates organizing your data in SIMD 
friendly manners.  Naive OOP based code is going to de-reference too 
many pointers to get to scattered data.  This makes the hardware 
prefetcher work too hard, and it wastes cache memory by only using a 
fraction of the RAM from the cache line, plus wasting 75-90% of the 
bandwidth and memory on the machine.

>
> D array operations are probably meant to become smarter, when you perform a:
>
> int[8] a, b, c;
> a = b + c;
>

Now the original topic pertains to data layouts, of which SIMD, the CPU 
cache, and efficient code all inter-relate.  I would argue the above 
code is an idealistic example, as when writing SIMD code you almost 
always have to transpose or rotate one of the sets of data to work in 
parallel across the other one.  What happens when this code has to 
branch?  In SIMD land you have to test if any or all 4 lanes of SIMD 
data need to take it.  And a lot of time the best course of action is to 
compute the other code path in addition to the first one, AND the fist 
result and NAND the second one and OR the results together to make valid 
output.  I could maybe see a functional language doing ok at this.   The 
only reasonable construct to be able to explain how common this is in 
optimized SIMD code, is to compare it to is HLSL's vectorized ternary 
operator (and understanding that 'a' and 'b' can be fairly intricate 
chunks of code if you are clever):

float4 a = {1,2,3,4};
float4 b = {5,6,7,8};
float4 c = {-1,0,1,2};
float4 d = {0,0,0,0};
float4 foo = (c > d) ? a : b;

results with foo = {5,6,3,4}

For a lot of algorithms the 'a' and 'b' path have similar cost, so for 
SIMD it executes about 2x faster than the scalar case, although better 
than 2x gains are possible since using SIMD also naturally reduces or 
eliminates a ton of branching which CPUs don't really like to do due to 
their long pipelines.

And as much as Intel likes to argue that a structure containing 
positions for a particle system should look like this because it makes 
their hardware benchmarks awesome, the following vertex layout is a failure:

struct ParticleVertex
{
float[1000] XPos;
float[1000] YPos;
float[1000] ZPos;
}

The GPU (or Audio devices) does not consume it this way. The data is 
also not cache coherent if you are trying to read or write a single 
vertex out of the structure.

A hybrid structure which is aware of the size of a SIMD register is the 
next logical choice:

align(16)
struct ParticleVertex
{
float[4] XPos;
float[4] YPos;
float[4] ZPos;
}
ParticleVertex[250] ParticleVertices;

// struct is also now 75% of a 64 byte cache line
// Also, 2 of any 4 random accesses for a vertex are in the same
// cache line, and only 2 are touched in the worst case

But this hybrid structure still has to be shuffled before being given to 
a GPU (albeit in much more bite size increments that could easily 
read-shuffle-write at the same speed of a platform optimized memcpy)

Things get real messy when you have multiple vertex attributes as 
decisions to keep them together or separate are conflicting and both 
choices make sense to different systems :)