Ready for review: new std.uni

Mon Jan 14 01:02:26 PST 2013

On 1/14/2013 12:28 AM, David Nadlinger wrote:
> Virtually all common C/C++ compilers have had SIMD types/intrinsics for years
> now, which is more or less exactly what core.simd is for D - well, there is some
> syntax sugar for arithmetic operations on the D vector types, but it doesn't
> cover any of the "interesting" operations such as broadcasting, swizzling, etc.

Actually, they do a crappy job of it in my experiments.

> Yet *all* the big compiler vendors (Microsoft, Intel, GCC, LLVM) seem to find it
> necessary to spend considerable amounts of effort on improving their heuristics
> for the indeed quite problematic optimization process you describe.

The fundamental problem they face is they cannot change the language - they have 
to work with bog standard code that has no notion of SIMD.

> I don't think that rehashing the common reasons cited for the importance of
> (auto-)vectorizing code here would be of any use, you are certainly aware of
> them already. Just don't forget that Manu (who seems to have disappeared from
> the forums, crunch time?) represents a group of programmers who would resort to
> hand-written assembly for their "embarrassingly parallel" code if suitable
> compiler intrinsics didn't exist, whereas the vast majority of programmers
> probably would be troubled to reliably identify the cases where vectorization
> would have the biggest benefits in their code bases, even if they could justify
> spending time/maintainability on manually optimizing their code at this level.

Here's the problem. Draw a table of SIMD vector types on one axis, and 
arithmetic operations on the other. Plot an X for what is supported.

You'll find that the pattern of X's doesn't follow much of a pattern, and it 
varies all over the place for different CPUs and architectures.

Now consider your autovectorizer. If an X isn't there, it (silently) generates 
workaround code. The only way you'll really be able to tell it went wrong is to 
disassemble the result - and who is going to do that?

And what's much, much worse about this, is pulling values out of and into SIMD 
registers to do those workarounds causes massive stalls (the SIMD unit and the 
arithmetic unit are separate), so bad that you're better off doing no 
vectorization at all.

The next problem with an autovectorizer is alignment - consider the C function:

     int func(float *p) { ... }

Now you want to autovectorize inside func(). Is p aligned? Who knows? So you 
must emit code to do misaligned vectors, which is another performance disaster.

The D approach is if you are able to get your vector code to compile, it will be:

1. aligned properly
2. compiled to real SIMD instructions