Small Vectors Proposal

Sat Feb 3 10:18:55 PST 2007

Knud Soerensen wrote:
> I think it is better to make very general framework for describing the
> vector operations, because when we have a very general description 
> then the compiler is free to optimize as it sees fit.
> You can think at it as a SQL for vector operations.

Well, I can't say that I share your enthusiasm for the power of 
necromantic compiler optimizations.  If you want to see what a 
vectorizing compiler looks like, take a gander at:

http://www.codeplay.com/

VectorC is a C compiler which focuses on architecture specific SIMD 
optimizations.  The problem with their approach is that it is too 
general.  If you want to create optimized vector code, you need to tweak 
things just-so to make the compiler understand it.  This means tedious 
trial-and-error optimization sessions where you end up rewriting the 
same piece of C-code 50 times in order to get the compiler to spit out 
the right assembler.  To facilitate this process, they provide several 
tools and pragmas you can use to give the compiler hints.

By the time you are done, you have a very fast vector optimized SSE 
function that only works on one specific compiler.  Good luck porting 
your results it to GCC!  Does it save you time at the end of the day? 
Maybe, but what happens when some future releases changes the optimizer 
and your code no longer comes out all clean and vectorized?  You have to 
go back and repeat the process from scratch.  For the amount of hair 
pulling these compilers cause I would just as soon write the damn code 
in assembler and be done with it once and for all.

The basic issue with VectorC and all the other vectorizing compilers is 
that they try to do too much.  Everyone seems so hell bent on creating 
this hyper-general-super-chocolate-coated-vectorizing-optimizer, when in 
reality it's not even necessary.  Small geometric vectors are far more 
common, and improving their utility is a much bigger win for less 
effort.  We need a convenient and fast set of arithmetic routines for 
doing simple geometric stuff, not an overarching super vector-framework.

> ... imagine that the most used operation in the program 
> is a plan projection on the x-y plan, then we only need the x and y
> coordinate.
> So an arrangement in memory which looks like 
> x1 x2 x3 . . .
> y1 y2 y3 . . .
> z1 z2 z3 . . .
> would result in that the cpu could specific load only the x and the y
> coordinates into the cache, which would speed up the calculations a lot.
> 
> So, if one could specify that 3x1000 floats and let the compiler
> arrange them in memory that would be optimal.

This doesn't strike me as very realistic.  How is the compiler ever 
going to figure out that you wanted to project these vectors onto the 
x-y plane somewhere down the road?  This is a very difficult global 
optimization, and I can think of no way to perform it without direct 
programmer intervention.

Moreover, I'm not sure why you split up the x-y components.  A much 
better layout would look like this:

x y x y x y x y x y ...
z z z ....

Now the alignment for each vector is such that you could load 2 x-y 
pairs into a single SSE register at once.  This is not only better for 
caching, but it is also better for performance.

However, neither the layout you proposed or the one above make much 
sense.  Typically you want to do more with vectors than just drop the 
z-component.  You need to perform matrix multiplication, normalization, 
dot and cross products etc.  Each of these operates most efficiently 
when all vector components are packed into a single SIMD register. 
Therefore it seems clear that the preferred layout ought to be:

x y z w x y z w ....

I can't really fathom why you would want to go through the extra trouble 
of splitting up each component.  Not only does it make arithmetic less 
efficient, but it also creates a book keeping nightmare.  Imagine trying 
to grow the total number of vectors in your original layout!  You would 
need to do multiple costly memcpy operations.

For low dimension vectors, the necessary compiler optimizations are 
obvious, and there is one clear 'best' data layout.  We don't need any 
fancy compiler magic, since the code can be directly translated into 
vector operations just like ordinary arithmetic.