SIMD/intrinsincs questions

Mon Nov 9 00:01:30 PST 2009

On 11/08/2009 11:28 PM, Robert Jacques wrote:
> By design, D asm blocks are separated from the optimizer: no code
> motion, etc occurs. D2 just changed fixed sized arrays to value types,
> which provide most of the functionality of a small vector struct.
> However, actual SSE optimization of these types is probably going to
> wait until x64 support; since a bunch of 32-bit chips don't support them.
>
> P.S. For what it's worth, I do research which involves volumetric
> ray-tracing. I've always found memory to bottleneck computations. Also,
> why not look into CUDA/OpenCL/DirectCompute?

Yeah, I've discovered that having either the constraints-based __asm() 
from ldc or actual intrinsics probably makes optimization opportunities 
more frequent.  But, if it at least inlined the regular asm blocks for 
me I'd be most of the way there.  The ldc guys tell me that they didn't 
include the llvm vector intrinsics already because they were going to 
need either a custom type in the frontend, or else the D2 
fixed-size-arrays-as-value-types functionality.  I might take a stab at 
some of that in ldc in the future to see if I can get it to work, but 
I'm not an expert in compilers by any stretch of the imagination.

-Mike

PS: As for trying CUDA/OpenCL/DirectCompute, I haven't gotten into it 
much for a few reasons:

* The standards and APIs are still evolving
* I refuse to pigeon-hole myself into windows (I'm typing this from a 
Fedora 11 box, and at work we're a linux shop doing movie VFX)
* Larrabee (yes, yes, semi-vaporware until Intel gets their crap 
together) will allow something much closer to standard CPU code.  I 
really think that's the direction the GPU makers are heading in general, 
so why hobble myself with cruddy GPU memory/threading models to code 
around right now?
* GPUs keep changing, and every change brings with it subtle (and 
sometimes drastic) effects on your code's performance and results from 
card to card.  It's a nightmare to maintain, and every project we've 
done trying to do production rendering stuff on GPU (even just 
relighting) has ended in tears and gnashing of teeth.  Everyone just 
eventually throws up their hands and goes back to optimized CPU 
rendering in the VFX industry (Pixar, ILM, Tippett have all done that, 
just to name a few).

Good, solid general purpose CPUs with caches, decently wide SIMD with 
scatter/gather, and plenty of hardware threads are the wave of the 
future.  (Or was that the past?  I can't remember.)

GPUs are slowly converging back to that, except that currently they have 
a programmer-managed cache (texture mem), and they execute multiple 
threads concurrently over the same instructions in groups (warps, in 
CUDA-speak?).  They'll eventually add the 'feature' of a more 
automatically-managed cache, and better memory throughput when allowing 
warps to be smaller and more flexible.  And they'll look nearly 
identical to all the multi-core CPUs again when it happens.