Small Vectors Proposal

Sat Feb 3 17:29:13 PST 2007

On Sat, 03 Feb 2007 13:18:55 -0500, Mikola Lysenko wrote:

> Knud Soerensen wrote:
>> I think it is better to make very general framework for describing the
>> vector operations, because when we have a very general description 
>> then the compiler is free to optimize as it sees fit.
>> You can think at it as a SQL for vector operations.
> 
> Well, I can't say that I share your enthusiasm for the power of 
> necromantic compiler optimizations.  If you want to see what a 
> vectorizing compiler looks like, take a gander at:
> 
> http://www.codeplay.com/
> 
> VectorC is a C compiler which focuses on architecture specific SIMD 
> optimizations.  The problem with their approach is that it is too 
> general.  If you want to create optimized vector code, you need to tweak 
> things just-so to make the compiler understand it.  This means tedious 
> trial-and-error optimization sessions where you end up rewriting the 
> same piece of C-code 50 times in order to get the compiler to spit out 
> the right assembler.  To facilitate this process, they provide several 
> tools and pragmas you can use to give the compiler hints.
> 
> By the time you are done, you have a very fast vector optimized SSE 
> function that only works on one specific compiler.  Good luck porting 
> your results it to GCC!  Does it save you time at the end of the day? 
> Maybe, but what happens when some future releases changes the optimizer 
> and your code no longer comes out all clean and vectorized?  You have to 
> go back and repeat the process from scratch.  For the amount of hair 
> pulling these compilers cause I would just as soon write the damn code 
> in assembler and be done with it once and for all.
> 
I agree with you on this. 
We need a way to write the vector code once and be able to compile it 
on many architectures with good results.

> The basic issue with VectorC and all the other vectorizing compilers is 
> that they try to do too much.  Everyone seems so hell bent on creating 
> this hyper-general-super-chocolate-coated-vectorizing-optimizer, when in 
> reality it's not even necessary.  Small geometric vectors are far more 
> common, and improving their utility is a much bigger win for less 
> effort.

That might be true for your projects but not for mine.

>  We need a convenient and fast set of arithmetic routines for 
> doing simple geometric stuff, not an overarching super vector-framework.

Actually, what I need is an overarching super TENSOR-framework ;-)

>> ... imagine that the most used operation in the program 
>> is a plan projection on the x-y plan, then we only need the x and y
>> coordinate.
>> So an arrangement in memory which looks like 
>> x1 x2 x3 . . .
>> y1 y2 y3 . . .
>> z1 z2 z3 . . .
>> would result in that the cpu could specific load only the x and the y
>> coordinates into the cache, which would speed up the calculations a lot.
>> 
>> So, if one could specify that 3x1000 floats and let the compiler
>> arrange them in memory that would be optimal.
> 
> This doesn't strike me as very realistic.  How is the compiler ever 
> going to figure out that you wanted to project these vectors onto the 
> x-y plane somewhere down the road?  This is a very difficult global 
> optimization, and I can think of no way to perform it without direct 
> programmer intervention.
> 

Yes, the optimization might be difficult but the point is
that the code should not impose artificial limitation on 
the compiler. If the compiler know how to make 
the optimization then it is free to make it.

> Moreover, I'm not sure why you split up the x-y components.  A much 
> better layout would look like this:
> 
> x y x y x y x y x y ...
> z z z ....
> 
> Now the alignment for each vector is such that you could load 2 x-y 
> pairs into a single SSE register at once.  This is not only better for 
> caching, but it is also better for performance.

Good a better example :-)  

> However, neither the layout you proposed or the one above make much 
> sense.  

I do not mean proposed a specific layout what I tried to say 
was that the compiler should be free to chose the layout
which fit the problem. 

> Typically you want to do more with vectors than just drop the 
> z-component.  You need to perform matrix multiplication, normalization, 
> dot and cross products etc.  Each of these operates most efficiently 
> when all vector components are packed into a single SIMD register. 
> Therefore it seems clear that the preferred layout ought to be:
> 
> x y z w x y z w ....

Then the compiler should use this layout for those typical problems,
but we should not limit the compiler to generate sub-optimal code
for non-typical problems. 

> I can't really fathom why you would want to go through the extra trouble 
> of splitting up each component.  Not only does it make arithmetic less 
> efficient, but it also creates a book keeping nightmare.  Imagine trying 
> to grow the total number of vectors in your original layout!  You would 
> need to do multiple costly memcpy operations.

Imagine trying to grow the number of dimensions in your layout ;-)

> For low dimension vectors, the necessary compiler optimizations are 
> obvious, and there is one clear 'best' data layout.  We don't need any 
> fancy compiler magic, since the code can be directly translated into 
> vector operations just like ordinary arithmetic.

Yes, good performance for typical problems is important 
but we should be careful not to chose a notation which 
ruin the performance for non-typical problems.