Small Vectors Proposal
Mikola Lysenko
mclysenk at mtu.edu
Sat Feb 3 10:18:55 PST 2007
Knud Soerensen wrote:
> I think it is better to make very general framework for describing the
> vector operations, because when we have a very general description
> then the compiler is free to optimize as it sees fit.
> You can think at it as a SQL for vector operations.
Well, I can't say that I share your enthusiasm for the power of
necromantic compiler optimizations. If you want to see what a
vectorizing compiler looks like, take a gander at:
http://www.codeplay.com/
VectorC is a C compiler which focuses on architecture specific SIMD
optimizations. The problem with their approach is that it is too
general. If you want to create optimized vector code, you need to tweak
things just-so to make the compiler understand it. This means tedious
trial-and-error optimization sessions where you end up rewriting the
same piece of C-code 50 times in order to get the compiler to spit out
the right assembler. To facilitate this process, they provide several
tools and pragmas you can use to give the compiler hints.
By the time you are done, you have a very fast vector optimized SSE
function that only works on one specific compiler. Good luck porting
your results it to GCC! Does it save you time at the end of the day?
Maybe, but what happens when some future releases changes the optimizer
and your code no longer comes out all clean and vectorized? You have to
go back and repeat the process from scratch. For the amount of hair
pulling these compilers cause I would just as soon write the damn code
in assembler and be done with it once and for all.
The basic issue with VectorC and all the other vectorizing compilers is
that they try to do too much. Everyone seems so hell bent on creating
this hyper-general-super-chocolate-coated-vectorizing-optimizer, when in
reality it's not even necessary. Small geometric vectors are far more
common, and improving their utility is a much bigger win for less
effort. We need a convenient and fast set of arithmetic routines for
doing simple geometric stuff, not an overarching super vector-framework.
> ... imagine that the most used operation in the program
> is a plan projection on the x-y plan, then we only need the x and y
> coordinate.
> So an arrangement in memory which looks like
> x1 x2 x3 . . .
> y1 y2 y3 . . .
> z1 z2 z3 . . .
> would result in that the cpu could specific load only the x and the y
> coordinates into the cache, which would speed up the calculations a lot.
>
> So, if one could specify that 3x1000 floats and let the compiler
> arrange them in memory that would be optimal.
This doesn't strike me as very realistic. How is the compiler ever
going to figure out that you wanted to project these vectors onto the
x-y plane somewhere down the road? This is a very difficult global
optimization, and I can think of no way to perform it without direct
programmer intervention.
Moreover, I'm not sure why you split up the x-y components. A much
better layout would look like this:
x y x y x y x y x y ...
z z z ....
Now the alignment for each vector is such that you could load 2 x-y
pairs into a single SSE register at once. This is not only better for
caching, but it is also better for performance.
However, neither the layout you proposed or the one above make much
sense. Typically you want to do more with vectors than just drop the
z-component. You need to perform matrix multiplication, normalization,
dot and cross products etc. Each of these operates most efficiently
when all vector components are packed into a single SIMD register.
Therefore it seems clear that the preferred layout ought to be:
x y z w x y z w ....
I can't really fathom why you would want to go through the extra trouble
of splitting up each component. Not only does it make arithmetic less
efficient, but it also creates a book keeping nightmare. Imagine trying
to grow the total number of vectors in your original layout! You would
need to do multiple costly memcpy operations.
For low dimension vectors, the necessary compiler optimizations are
obvious, and there is one clear 'best' data layout. We don't need any
fancy compiler magic, since the code can be directly translated into
vector operations just like ordinary arithmetic.
More information about the Digitalmars-d
mailing list