OOP, faster data layouts, compilers
Sean Cavanaugh
WorksOnMyMachine at gmail.com
Fri Apr 22 18:12:08 PDT 2011
On 4/22/2011 4:41 PM, bearophile wrote:
> Sean Cavanaugh:
>
>> In many ways the biggest thing I use regularly in game development that
>> I would lose by moving to D would be good built-in SIMD support. The PC
>> compilers from MS and Intel both have intrinsic data types and
>> instructions that cover all the operations from SSE1 up to AVX. The
>> intrinsics are nice in that the job of register allocation and
>> scheduling is given to the compiler and generally the code it outputs is
>> good enough (though it needs to be watched at times).
>
> This is a topic quite different from the one I was talking about, but it's an interesting topic :-)
>
> SIMD intrinsics look ugly, they add lot of noise to the code, and are very specific to one CPU, or instruction set. You can't design a clean language with hundreds of those. Once 256 or 512 bit registers come, you need to add new intrinsics and change your code to use them. This is not so good.
In C++ the intrinsics are easily wrapped by __forceinline global
functions, to provide a platform abstraction against the intrinsics.
Then, you can write class wrappers to provide the most common level of
functionality, which boils down to a class to do vectorized math
operators for + - * / and vectorized comparison functions == != >= <= <
and >. From HLSL you have to borrow the 'any' and 'all' statements
(along with variations for every permutation of the bitmask of the test
result) to do conditional branching for the tests. This pretty much
leaves swizzle/shuffle/permuting and outlying features (8,16,64 bit
integers) in the realm of 'ugly'.
From here you could build up portable SIMD transcendental functions
(sin, cos, pow, log, etc), and other libraries (matrix multiplication,
inversion, quaternions etc).
I would say in D this could be faked provided the language at a minimum
understood what a 128 (SSE1 through 4.2) and 256 bit value (AVX) was and
how to efficiently move it via registers for function calls. Kind of
'make it at least work in the ABI, come back to a good implementation
later' solution. There is some room to beat Microsoft here, as the the
code visual studio 2010 outputs currently for 64 bit environments cannot
pass 128 bit SIMD values by register (forceinline functions are the only
workaround), even though scalar 32 and 64 bit float values are passed by
XMM register just fine.
The current hardware landscape dictates organizing your data in SIMD
friendly manners. Naive OOP based code is going to de-reference too
many pointers to get to scattered data. This makes the hardware
prefetcher work too hard, and it wastes cache memory by only using a
fraction of the RAM from the cache line, plus wasting 75-90% of the
bandwidth and memory on the machine.
>
> D array operations are probably meant to become smarter, when you perform a:
>
> int[8] a, b, c;
> a = b + c;
>
Now the original topic pertains to data layouts, of which SIMD, the CPU
cache, and efficient code all inter-relate. I would argue the above
code is an idealistic example, as when writing SIMD code you almost
always have to transpose or rotate one of the sets of data to work in
parallel across the other one. What happens when this code has to
branch? In SIMD land you have to test if any or all 4 lanes of SIMD
data need to take it. And a lot of time the best course of action is to
compute the other code path in addition to the first one, AND the fist
result and NAND the second one and OR the results together to make valid
output. I could maybe see a functional language doing ok at this. The
only reasonable construct to be able to explain how common this is in
optimized SIMD code, is to compare it to is HLSL's vectorized ternary
operator (and understanding that 'a' and 'b' can be fairly intricate
chunks of code if you are clever):
float4 a = {1,2,3,4};
float4 b = {5,6,7,8};
float4 c = {-1,0,1,2};
float4 d = {0,0,0,0};
float4 foo = (c > d) ? a : b;
results with foo = {5,6,3,4}
For a lot of algorithms the 'a' and 'b' path have similar cost, so for
SIMD it executes about 2x faster than the scalar case, although better
than 2x gains are possible since using SIMD also naturally reduces or
eliminates a ton of branching which CPUs don't really like to do due to
their long pipelines.
And as much as Intel likes to argue that a structure containing
positions for a particle system should look like this because it makes
their hardware benchmarks awesome, the following vertex layout is a failure:
struct ParticleVertex
{
float[1000] XPos;
float[1000] YPos;
float[1000] ZPos;
}
The GPU (or Audio devices) does not consume it this way. The data is
also not cache coherent if you are trying to read or write a single
vertex out of the structure.
A hybrid structure which is aware of the size of a SIMD register is the
next logical choice:
align(16)
struct ParticleVertex
{
float[4] XPos;
float[4] YPos;
float[4] ZPos;
}
ParticleVertex[250] ParticleVertices;
// struct is also now 75% of a 64 byte cache line
// Also, 2 of any 4 random accesses for a vertex are in the same
// cache line, and only 2 are touched in the worst case
But this hybrid structure still has to be shuffled before being given to
a GPU (albeit in much more bite size increments that could easily
read-shuffle-write at the same speed of a platform optimized memcpy)
Things get real messy when you have multiple vertex attributes as
decisions to keep them together or separate are conflicting and both
choices make sense to different systems :)
More information about the Digitalmars-d
mailing list