DeRailed DSL (was Re: compile-time regex redux)

Sun Feb 11 08:58:16 PST 2007

Don Clugston wrote:
> Andrei Alexandrescu (See Website for Email) wrote:
>> Bill Baxter wrote:
>>> Andrei Alexandrescu (See Website For Email) wrote:
>>>> This is a misunderstanding. The syntax is not to be extended. It 
>>>> stays fixed, and that is arguably a good thing. The semantics become 
>>>> more flexible. For example, they will make it easy to write a matrix 
>>>> operation:
>>>>
>>>> A = (I - B) * C + D
>>>>
>>>> and generate highly performant code from it. (There are many reasons 
>>>> for which that's way harder than it looks.)
>>>
>>> This is one thing I haven't really understood in the discussion.  How 
>>> do the current proposals help that case?  From what I'm getting 
>>> you're going to have to write every statement like the above as 
>>> something like:
>>>
>>> mixin {
>>>    ProcessMatrixExpr!( "A = (I - B) * C + D;" );
>>> }
>>>
>>> How do you get from this mixin/string processing stuff to an API that 
>>> I might actually be willing to use for common operations?
>>
>> That's a great starting point for a good conversation. And no, you
>> wouldn't do it with strings, you'd do it exactly as I wrote, with
>> regular code. You could use strings if you want to use more math-looking
>> operators and Greek symbols etc.
>>
>>>> I think there is a lot of apprehension and misunderstanding 
>>>> surrounding what metacode is able and supposed to do or simplify. 
>>>> Please, let's focus on understanding _before_ forming an opinion.
>>>
>>> I think that's exactly what Tom's getting at.  He's asking for 
>>> examples of how this would make like better for him and others.  I 
>>> think given your background you take it for granted that 
>>> metaprogramming is the future.  But D is attracts folks from all 
>>> kinds of walks of life, because it promises to be a kinder, gentler 
>>> C++.  So some people here aren't even sure why D needs templates at 
>>> all.  Fortran doesn't have 'em after all.  And Java just barely does.
>>
>> Then such people would be hard-pressed to figure why tuples (which are
>> essentially typelists, the single most used component in Loki) have
>> engendered, according to Walter himself, a surge in language popularity.
>>
>> I don't even think metaprogramming is the future. I think that it's one
>> of many helpful tools for writing good libraries, as are GC, modules, or
>> interfaces. *Everybody* appreciates good libraries.
>>
>>> Anyway, I think it would help get everyone on board if some specific 
>>> and useful examples were given of how this solves real problems (and 
>>> no I don't really consider black and white holes as solving real 
>>> problems.  I couldn't even find any non-astronomical uses of the 
>>> terms in a google search.)
>>
>> The terms are recent and introduced by the Perl community. I happened to
>> like the metaphor. The consecrated name is "null object pattern".
>>
>>> For instance, it would be nice to see some more concrete discussion 
>>> about
>>> * the "rails" case.
>>> * the X = A*B + C matrix/vector expressions case.
>>> * the case of generating bindings to scripting langauges / ORB stubs
>>> * the Spirit/parser generator case
>>>
>>> So here's my take on vector expressions since that's the only one I 
>>> know anything about.
>>>
>>> *Problem statement*:
>>> Make the expression A=(I-B)*C+D efficient, where the variables are 
>>> large vectors (I'll leave out matrices for now).
>>>
>>> *Why it's hard*:
>>> The difficulty is that (ignoring SSE instruction etc) the most 
>>> efficient way to compute that is do all operations component-wise.  
>>> So instead of computing I-B then multiplying by C, you compute
>>>     A[i] = (I[i]-B[i])*C[i]+D[i];
>>> for each i.  This eliminates the need to allocate large intermediate 
>>> vectors.
>>
>> There's much more than that. (First off, things are more complicated in
>> the case of matrices.) You want to be gentle on cache, so when you have
>> any column-wise iteration you want to do matrix blocking. Depending on
>> the expression, you want to select different block sizes.
>>
>> Then you also want to do optionally partial unrolling *on top of
>> blocking*. Again, the amount of unrolling might depend on the
>> expression. (You don't want to unroll large expressions.) At this point
>> things are utterly messy.
>>
>> Iterating may also be better row-wise or column-wise, depending on the
>> expression. Some subexpressions have a "preferential" row-wise scanning
>> order and a "possible" column-wise order. Others must do things one way
>> or the other. For all this the appropriate code must be generated.
>>
>>> *Existing solutions*:
>>>   Expression templates in C++, e.g. The Blitz++ library.  Instead of 
>>> making opSub in I-B return a new Vector object, you make opSub return 
>>> an ExpressionTemplate object.  This is a little template struct that 
>>> contains a reference to I and to B, and knows how to subtract the two 
>>> in a component-wise manner.  The types of I and B are template 
>>> parameters, LeftT and RightT. Its interface also allows it to be 
>>> treated just like a Vector.  You can add a vector to it, subtract a 
>>> Vector from it etc.
>>>
>>> Now we go and try to multiply that result times C.  The result of 
>>> that is a new MultExpressionTemplate with two paramters, the LeftT 
>>> being our previous SubExpressionTemplate!(Vector,Vector) and the 
>>> RightT being Vector.
>>>
>>> Proceeding on in this way eventually the result of the math is of type:
>>>
>>> AddExpressionTemplate!(
>>>    MultExpressionTemplate!(
>>>       SubExpressionTemplate!(Vector,Vector),
>>>       Vector),
>>>    Vector)
>>>
>>> And you can see that we basically have a parse tree expressed as 
>>> nested templates.  The final trick is that a Vector.opAssign that 
>>> takes an ExpressionTemplate is provided and that method calls a 
>>> method of the expression template to finally trigger the calculation, 
>>> like expr.eval(this).  eval() has the top-level loop over the 
>>> components of the answer.
>>>
>>> *Why Existing solutions are insufficient*
>>> For that all that effort to actually be useful the compiler has to be 
>>> pretty agressive about inlining everything so that in the end all the 
>>> temporary template structs and function calls go away and you're just 
>>> left with one eval call.  It can be tricky to get the code into just 
>>> the right configuration so that the compiler will do the inlining.  
>>> And even then results will depend on the quality of the compiler.  My 
>>> attempts using MSVC++ several years ago always came out to be slower 
>>> than the naive version, but with about 10x the amount of code.  The 
>>> code is also pretty difficult to follow because of all those 
>>> different types of expression templates that have to be created.
>>>
>>> If you include matrices things are even trickier because there are 
>>> special cases like "A*B+a*C" that can be computed efficiently by a 
>>> single optimized routine.  You'd like to recognize such cases and 
>>> turn them into single calls to that fast routine.  You might also 
>>> want to recognize "A*B+C" as a special case of that with a==1.
>>>
>>> *How it could be improved*:
>>> ??? this is what I'd like to see explained better.
>>
>> Great. One problem is that building ET libraries is exceedingly hard
>> because of the multiple issues that must be solved simultaneously. To
>> get a glimpse into the kind of problems that we are talking about, see
>> e.g. http://www.adtmag.com/joop/carticle.aspx?ID=627. Libraries cannot
>> afford to implement an optimal solution, partly because they have so
>> much language-y mud to go through, and partly because the C++ language
>> simply does not offer the code generation abilities that are needed.
>>
>> I haven't sat down to design a linear algebra using D's new abilities,
>> but they are definitely helpful in that specifying an elementary matrix
>> operation (the core of a loop) only needs one template. For example, if
>> you want to specify the kernel of a matrix addition, you can say it as a
>> type:
>>
>> MatrixOp!("lhs[i, j] + rhs[i, j]", rowwise, columnwise)
>>
>> The type specifies the kernel of the operation, conventionally expressed
>> in terms of lhs, rhs, i, and j, then the preferred iteration order, and
>> finally the alternate order. The MatrixOp template defines a member
>> function get(uint i, uint j) that expands the string given in the
>>
>> The kernel of multiplication is:
>>
>> MatrixOp!("sum!(k, 0, n, lhs[i, k] * rhs[k, j])", rowwise, columnwise,
>> blocking!(8), unrolling!(8))
>>
>> (sum is not doable in today's D because D lacks the ability to inject a
>> new alias name into a template upon invocation.)
>>
>> This type specifies how an element of the multiplication is obtained,
>> again what preferred and alternate method of iteration is possible, and
>> also specifies blocking and unrolling suggestions.
>>
>> In a complex expression, lhs and rhs will bind to subexpressions; it's
>> easy to then generate the appropriate code upon assignment to the
>> target, with heuristically chosen blocking and unrolling parameters
>> based on the suggestions made by the subexpressions. All this is doable
>> without difficulty through manipulating and generating code during
>> compilation.
>>
>>
>> Andrei
> 
> I've also done some experimentation on this problem.
> To do a perfect solution, you'd need to be able to identify when a 
> variable is used twice, for example,
> 
> a = 3.0 * b + (c - a * 1.5);
> The compiler knows that 'a' occurs twice, but that information is not 
> transferred into expression templates.

I think what you really need is aliasing information (e.g. two names 
referring to the same vector), and that is typically not easily computable.

> The compiler may also know the 
> length of some of the arrays, and that is also lost.

Ah, indeed. So that would mean better error checking and probably less 
runtime bounds checking.

> I've almost completed a BLAS1 generator for x87 (not SSE), which spits 
> out _optimal_ asm code for any combination of vector-scalar operations 
> (vec+vec, vec-vec, vec*real, dot(vec, vec), for any mixture of 32-,64-, 
> and 80- bit vectors. The cache effects are simple in this case.
> The code to do this is amazingly compact, thanks to tuples and the new 
> 1.005 mixins (in fact, it is a small fraction of the size of an asm 
> BLAS1 kernal).
> I noticed that the 'expression template' part of the code contained 
> almost nothing specific to vectors; I think there's potential for a 
> library mixin component to do it.
> 
> Of course another approach would be to add
> opAssignExpression(char [] expr)()
> {
> }
> where "expr" is the verbatim expression.
> a = b*5 + c*d;
> 
> would become
> mixin a.opAssignExpression!("a=b*5+c*d");
> 
> since it seems that with expression templates, you're fighting the 
> compiler, attempting to undo everything it has done.

Not everything - e.g., precedence of operators remains unchanged. In the 
string-based approach you'd have to write a little parser to reimplement 
operator precedences. But point taken.

Andrei