BLADE 0.2Alpha: Vector operations with mixins, expression templates, and asm

Wed Apr 4 01:53:53 PDT 2007

Witold Baryluk wrote:
> Hello,
> 
> Very good work. I'm impressed. Actually i was developing very
> similar program, but yours is practicly all I was needing.
> I'm developing linear algebra package in D, and now I'm optimising
> it for vector machines, so your BLAS1-like pacakge will be
> helpful. :)

Do you intend to implement BLAS2 and BLAS3-like functionality? I feel 
that I don't know enough about cache-efficient matrix blocking 
techniques to be confident in presenting code. I don't know how 
sophisticated the techniques are in libraries like ATLAS, but the ones 
used in Blitz++ should be easy to compete with.

Blade is still mostly proof-of-concept rather than being 
industrial-strength -- there are so many possibilities for improvement, 
it's not well tested, and I haven't even put any effort into making the 
code look nice. But as Davidl said, it shows that hard-core back-end 
optimisation can now be done in a library at compile time.

The code is quite modularised, so it would be straightforward to add a 
check for 'is it SSE-able' (ie, are all the vectors floats) and 'is it 
SSE2-able' (are all the vectors doubles), and if so, send them off into 
specialised assemblers, otherwise send them to the x87 one. The 
postfix-ing step will involve searching for *+ combinations where fma 
can be used. The ability to know the vector length in many cases can be 
a huge advantage for SSE code, where loop unrolling by a factor of 2 or 
4 is always necessary. I haven't got that far yet.

I'm pretty sure that this type of technique can blow almost everything 
else out of the water. <g>