Can we fix reverse operator overloading (opSub_r et. al.)?

Robert Jacques sandford at jhu.edu
Sun Jul 12 19:33:25 PDT 2009


On Sun, 12 Jul 2009 11:14:02 -0400, Jérôme M. Berger <jeberger at free.fr>  
wrote:

> Robert Jacques wrote:
>> On Sat, 11 Jul 2009 06:14:56 -0400, Lutger
>> <lutger.blijdestijn at gmail.com> wrote:
>>
>>> "Jérôme M. Berger" wrote:
>>> (...)
>>>>>
>>>>> BLADE has already shown that it is possible to do stuff like this in  
>>>>> a
>>>>> library, but I think it goes without saying that if it was built into
>>>>> the language the syntax could be made considerably nicer. Compare:
>>>>>
>>>>>   auto m = MatrixOp!("a*A*B + b*C")(aVal, bVal, aMtrx, bMtrx, cMtrx);
>>>>>
>>>>>   auto m = a*A*B + b*C;
>>>>>
>>>>> If D could do this, I think it would become the next FORTRAN. :)
>>>>>
>>>>> -Lars
>>>>
>>>> Actually, this has already been done in C++:
>>>> http://flens.sourceforge.net/ It should be possible to port it to D...
>>>>
>>>> Jerome
>>>
>>> Someone correct me if I'm wrong, but I think what Blade does is a bit
>>> more
>>> advanced than FLENS. Blade performs optimizations on the AST level and
>>> generates (near) optimal assembly at compile time. I couldn't find
>>> info on
>>> what FLENS does exactly beyond inlining through template expressions,  
>>> but
>>> from the looks of it it doesn't do any of the AST level optimizations
>>> Blade
>>> does. Anyone care to provide more info? Can Blade also generate better
>>> asm
>>> than is possible with libraries such as FLENS?
>>
>> Flens (and several other libraries like it) just provide a syntactic
>> sugar for BLAS using expression templates. So calling something like a =
>> b + c + d gets transformed into (if you're lucky) a = b + c; a = a + d;
>> (and if you're unlucky) temp = b + c; a = temp + d;
>
> 	The way I understand it, FLENS will never take the second solution.
> If it cannot do the first with BLAS, it will fail to compile in
> order to let you choose the best way to decompose your operation.
>
>> So you end up looping through memory multiple times. The next best  
>> option are
>> expression templates, (Blitz or Boost come to mind) which encode the
>> arguments of each operation into a struct. This results in a lot of
>> temporaries and is a major performance hit for small-ish vectors, but
>> you only loop through memory once and make no allocations, which is a
>> win on larger vectors. Then you have BLADE and D array ops which don't
>> create any temporaries and are faster still. The counter is that a BLAS
>> library can be tuned for each specific CPU or GPU and select the fastest
>> library at runtime.
>>
> 	Actually, depending on your architecture, amount of data and number
> of operands, looping through memory multiple may in theory be the
> best option in some cases. This is because it will use the cache
> (and the swap) more efficiently than a solution that loops only once
> through all arrays at the same time.

For arrays this is never, ever the case. Array ops are memory bandwidth  
bound. (Okay, with enough operands you could mess up the cache, but 100s  
of operands are pathological) What you're probably thinking about are the  
matrix multiply algorithms, but these algorithms inherently loop through  
memory multiple times. It's not something 'added' by the limitation of a  
fixed API.



More information about the Digitalmars-d mailing list