Can we fix reverse operator overloading (opSub_r et. al.)?

Robert Jacques sandford at jhu.edu
Sun Jul 12 07:54:50 PDT 2009


On Sat, 11 Jul 2009 06:14:56 -0400, Lutger <lutger.blijdestijn at gmail.com>  
wrote:

> "Jérôme M. Berger" wrote:
> (...)
>>>
>>> BLADE has already shown that it is possible to do stuff like this in a
>>> library, but I think it goes without saying that if it was built into
>>> the language the syntax could be made considerably nicer. Compare:
>>>
>>>   auto m = MatrixOp!("a*A*B + b*C")(aVal, bVal, aMtrx, bMtrx, cMtrx);
>>>
>>>   auto m = a*A*B + b*C;
>>>
>>> If D could do this, I think it would become the next FORTRAN. :)
>>>
>>> -Lars
>>
>> Actually, this has already been done in C++:
>> http://flens.sourceforge.net/ It should be possible to port it to D...
>>
>> Jerome
>
> Someone correct me if I'm wrong, but I think what Blade does is a bit  
> more
> advanced than FLENS. Blade performs optimizations on the AST level and
> generates (near) optimal assembly at compile time. I couldn't find info  
> on
> what FLENS does exactly beyond inlining through template expressions, but
> from the looks of it it doesn't do any of the AST level optimizations  
> Blade
> does. Anyone care to provide more info? Can Blade also generate better  
> asm
> than is possible with libraries such as FLENS?

Flens (and several other libraries like it) just provide a syntactic sugar  
for BLAS using expression templates. So calling something like a = b + c +  
d gets transformed into (if you're lucky) a = b + c; a = a + d; (and if  
you're unlucky) temp = b + c; a = temp + d; So you end up looping through  
memory multiple times. The next best option are expression templates,  
(Blitz or Boost come to mind) which encode the arguments of each operation  
into a struct. This results in a lot of temporaries and is a major  
performance hit for small-ish vectors, but you only loop through memory  
once and make no allocations, which is a win on larger vectors. Then you  
have BLADE and D array ops which don't create any temporaries and are  
faster still. The counter is that a BLAS library can be tuned for each  
specific CPU or GPU and select the fastest library at runtime.




More information about the Digitalmars-d mailing list