DeRailed DSL (was Re: compile-time regex redux)

Sun Feb 11 06:30:56 PST 2007

Andrei Alexandrescu (See Website for Email) wrote:
> Bill Baxter wrote:
>> Andrei Alexandrescu (See Website For Email) wrote:
>>> This is a misunderstanding. The syntax is not to be extended. It 
>>> stays fixed, and that is arguably a good thing. The semantics become 
>>> more flexible. For example, they will make it easy to write a matrix 
>>> operation:
>>>
>>> A = (I - B) * C + D
>>>
>>> and generate highly performant code from it. (There are many reasons 
>>> for which that's way harder than it looks.)
>>
>> This is one thing I haven't really understood in the discussion.  How 
>> do the current proposals help that case?  From what I'm getting you're 
>> going to have to write every statement like the above as something like:
>>
>> mixin {
>>    ProcessMatrixExpr!( "A = (I - B) * C + D;" );
>> }
>>
>> How do you get from this mixin/string processing stuff to an API that 
>> I might actually be willing to use for common operations?
> 
> That's a great starting point for a good conversation. And no, you
> wouldn't do it with strings, you'd do it exactly as I wrote, with
> regular code. You could use strings if you want to use more math-looking
> operators and Greek symbols etc.
> 
>>> I think there is a lot of apprehension and misunderstanding 
>>> surrounding what metacode is able and supposed to do or simplify. 
>>> Please, let's focus on understanding _before_ forming an opinion.
>>
>> I think that's exactly what Tom's getting at.  He's asking for 
>> examples of how this would make like better for him and others.  I 
>> think given your background you take it for granted that 
>> metaprogramming is the future.  But D is attracts folks from all kinds 
>> of walks of life, because it promises to be a kinder, gentler C++.  So 
>> some people here aren't even sure why D needs templates at all.  
>> Fortran doesn't have 'em after all.  And Java just barely does.
> 
> Then such people would be hard-pressed to figure why tuples (which are
> essentially typelists, the single most used component in Loki) have
> engendered, according to Walter himself, a surge in language popularity.
> 
> I don't even think metaprogramming is the future. I think that it's one
> of many helpful tools for writing good libraries, as are GC, modules, or
> interfaces. *Everybody* appreciates good libraries.
> 
>> Anyway, I think it would help get everyone on board if some specific 
>> and useful examples were given of how this solves real problems (and 
>> no I don't really consider black and white holes as solving real 
>> problems.  I couldn't even find any non-astronomical uses of the terms 
>> in a google search.)
> 
> The terms are recent and introduced by the Perl community. I happened to
> like the metaphor. The consecrated name is "null object pattern".
> 
>> For instance, it would be nice to see some more concrete discussion about
>> * the "rails" case.
>> * the X = A*B + C matrix/vector expressions case.
>> * the case of generating bindings to scripting langauges / ORB stubs
>> * the Spirit/parser generator case
>>
>> So here's my take on vector expressions since that's the only one I 
>> know anything about.
>>
>> *Problem statement*:
>> Make the expression A=(I-B)*C+D efficient, where the variables are 
>> large vectors (I'll leave out matrices for now).
>>
>> *Why it's hard*:
>> The difficulty is that (ignoring SSE instruction etc) the most 
>> efficient way to compute that is do all operations component-wise.  So 
>> instead of computing I-B then multiplying by C, you compute
>>     A[i] = (I[i]-B[i])*C[i]+D[i];
>> for each i.  This eliminates the need to allocate large intermediate 
>> vectors.
> 
> There's much more than that. (First off, things are more complicated in
> the case of matrices.) You want to be gentle on cache, so when you have
> any column-wise iteration you want to do matrix blocking. Depending on
> the expression, you want to select different block sizes.
> 
> Then you also want to do optionally partial unrolling *on top of
> blocking*. Again, the amount of unrolling might depend on the
> expression. (You don't want to unroll large expressions.) At this point
> things are utterly messy.
> 
> Iterating may also be better row-wise or column-wise, depending on the
> expression. Some subexpressions have a "preferential" row-wise scanning
> order and a "possible" column-wise order. Others must do things one way
> or the other. For all this the appropriate code must be generated.
> 
>> *Existing solutions*:
>>   Expression templates in C++, e.g. The Blitz++ library.  Instead of 
>> making opSub in I-B return a new Vector object, you make opSub return 
>> an ExpressionTemplate object.  This is a little template struct that 
>> contains a reference to I and to B, and knows how to subtract the two 
>> in a component-wise manner.  The types of I and B are template 
>> parameters, LeftT and RightT. Its interface also allows it to be 
>> treated just like a Vector.  You can add a vector to it, subtract a 
>> Vector from it etc.
>>
>> Now we go and try to multiply that result times C.  The result of that 
>> is a new MultExpressionTemplate with two paramters, the LeftT being 
>> our previous SubExpressionTemplate!(Vector,Vector) and the RightT 
>> being Vector.
>>
>> Proceeding on in this way eventually the result of the math is of type:
>>
>> AddExpressionTemplate!(
>>    MultExpressionTemplate!(
>>       SubExpressionTemplate!(Vector,Vector),
>>       Vector),
>>    Vector)
>>
>> And you can see that we basically have a parse tree expressed as 
>> nested templates.  The final trick is that a Vector.opAssign that 
>> takes an ExpressionTemplate is provided and that method calls a method 
>> of the expression template to finally trigger the calculation, like 
>> expr.eval(this).  eval() has the top-level loop over the components of 
>> the answer.
>>
>> *Why Existing solutions are insufficient*
>> For that all that effort to actually be useful the compiler has to be 
>> pretty agressive about inlining everything so that in the end all the 
>> temporary template structs and function calls go away and you're just 
>> left with one eval call.  It can be tricky to get the code into just 
>> the right configuration so that the compiler will do the inlining.  
>> And even then results will depend on the quality of the compiler.  My 
>> attempts using MSVC++ several years ago always came out to be slower 
>> than the naive version, but with about 10x the amount of code.  The 
>> code is also pretty difficult to follow because of all those different 
>> types of expression templates that have to be created.
>>
>> If you include matrices things are even trickier because there are 
>> special cases like "A*B+a*C" that can be computed efficiently by a 
>> single optimized routine.  You'd like to recognize such cases and turn 
>> them into single calls to that fast routine.  You might also want to 
>> recognize "A*B+C" as a special case of that with a==1.
>>
>> *How it could be improved*:
>> ??? this is what I'd like to see explained better.
> 
> Great. One problem is that building ET libraries is exceedingly hard
> because of the multiple issues that must be solved simultaneously. To
> get a glimpse into the kind of problems that we are talking about, see
> e.g. http://www.adtmag.com/joop/carticle.aspx?ID=627. Libraries cannot
> afford to implement an optimal solution, partly because they have so
> much language-y mud to go through, and partly because the C++ language
> simply does not offer the code generation abilities that are needed.
> 
> I haven't sat down to design a linear algebra using D's new abilities,
> but they are definitely helpful in that specifying an elementary matrix
> operation (the core of a loop) only needs one template. For example, if
> you want to specify the kernel of a matrix addition, you can say it as a
> type:
> 
> MatrixOp!("lhs[i, j] + rhs[i, j]", rowwise, columnwise)
> 
> The type specifies the kernel of the operation, conventionally expressed
> in terms of lhs, rhs, i, and j, then the preferred iteration order, and
> finally the alternate order. The MatrixOp template defines a member
> function get(uint i, uint j) that expands the string given in the
> 
> The kernel of multiplication is:
> 
> MatrixOp!("sum!(k, 0, n, lhs[i, k] * rhs[k, j])", rowwise, columnwise,
> blocking!(8), unrolling!(8))
> 
> (sum is not doable in today's D because D lacks the ability to inject a
> new alias name into a template upon invocation.)
> 
> This type specifies how an element of the multiplication is obtained,
> again what preferred and alternate method of iteration is possible, and
> also specifies blocking and unrolling suggestions.
> 
> In a complex expression, lhs and rhs will bind to subexpressions; it's
> easy to then generate the appropriate code upon assignment to the
> target, with heuristically chosen blocking and unrolling parameters
> based on the suggestions made by the subexpressions. All this is doable
> without difficulty through manipulating and generating code during
> compilation.
> 
> 
> Andrei

I've also done some experimentation on this problem.
To do a perfect solution, you'd need to be able to identify when a 
variable is used twice, for example,

a = 3.0 * b + (c - a * 1.5);
The compiler knows that 'a' occurs twice, but that information is not 
transferred into expression templates. The compiler may also know the 
length of some of the arrays, and that is also lost.

I've almost completed a BLAS1 generator for x87 (not SSE), which spits 
out _optimal_ asm code for any combination of vector-scalar operations 
(vec+vec, vec-vec, vec*real, dot(vec, vec), for any mixture of 32-,64-, 
and 80- bit vectors. The cache effects are simple in this case.
The code to do this is amazingly compact, thanks to tuples and the new 
1.005 mixins (in fact, it is a small fraction of the size of an asm 
BLAS1 kernal).
I noticed that the 'expression template' part of the code contained 
almost nothing specific to vectors; I think there's potential for a 
library mixin component to do it.

Of course another approach would be to add
opAssignExpression(char [] expr)()
{
}
where "expr" is the verbatim expression.
a = b*5 + c*d;

would become
mixin a.opAssignExpression!("a=b*5+c*d");

since it seems that with expression templates, you're fighting the 
compiler, attempting to undo everything it has done.