SIMD/intrinsincs questions

Tue Nov 10 09:57:31 PST 2009

Walter Bright Wrote:

> Don wrote:
> > The bad news: The DMD back-end is a state-of-the-art backend from the 
> > late 90's. Despite its age, its treatment of integer operations is, in 
> > general, still quite respectable.
> 
> Modern compilers don't do much better. The point of diminishing returns 
> was clearly reached.
> 
> > However, it _never_ generates SSE 
> > instructions. Ever. However, array operations _are_ detected, and they 
> > become to calls to library functions which use SSE if available. That's 
> > not bad for moderately large arrays -- 200 elements or so -- but of 
> > course it's completely non-optimal for short arrays.
> > 
> > The good news: Now that static arrays are passed by value, introducing 
> > inline SSE support for short arrays suddenly makes a lot of sense -- 
> > there can be a big performance benefit for a small backend change; it 
> > could be done without introducing SSE anywhere else. Most importantly, 
> > it doesn't require any auto-vectorisation support.
> 
> What the library functions also do is have a runtime switch based on the 
> capabilities of the processor, switching to operations tailored to that 
> processor. To generate the code directly, assuming the existence of SSE, 
> is to mean the code will only run on modern chips. Whether or not this 
> is a problem depends on your application.

For my purposes, runtime detection is probably out the window, unless the tests for it can happen infrequently enough to reduce the overhead.  There are too many SSE variations to switch on them all, and they incrementally provide better and better functionality that I could make use of.  I'd rather compile different executables for different hardware and distribute them all (e.g. detect the SSE version at compile time).  Really, high performance graphics is an exercise in getting tightly vectorized code to inline appropriately, eliminate as many loads and stores as possible, and then on top of that build algorithms that don't suck in runtime or memory/cache complexity.

Often in computer graphics you end up distilling a huge amount of operations down to SIMD instructions that are very highly-threaded and have (hopefully) minimal I/O.  If you introduce any extra overhead for getting to those SIMD instructions, you usually take a measurable throughput hit.  I'd like to see D give me a much better mix of high throughput + high coding productivity.  As it stands, I've got high throughput + medium coding productivity in C++, and medium throughput + fairly high coding productivity in C#.

I've started looking at some ldc code to lurch towards this goal, and if there is something I can look at in dmd2 itself to help out, I'd love to.  Just point me where you think I ought to start.

-Mike