BLADE 0.2Alpha: Vector operations with mixins, expression templates, and asm

Mon Apr 9 11:37:34 PDT 2007

Don Clugston wrote:
> Pragma wrote:
>> Bruno Medeiros wrote:
>>> Pragma wrote:
>>>> Don Clugston wrote:
>>>>> I have been trying to come up with a convincing use case for the 
>>>>> new mixins (and for metaprogramming in general). My best effort to 
>>>>> date can be found at:
>>>>> http://www.dsource.org/projects/mathextra/browser/trunk/mathextra/Blade.d 
>>>>>
>>>>>
>>>>> It generates near-optimal x87 asm code for BLAS1-style basic vector 
>>>>> operations. 32, 64 and 80 bit vectors are all supported.
>>>>>
>>>>> Compile with -version=BladeDebug to see the asm code which is 
>>>>> generated.
>>>>>
>>>>> Typical usage:
>>>>>
>>>>> void main()
>>>>> {
>>>>>     auto p = Vec([1.0L, 2, 18]);    // a vector of 80-bit reals.
>>>>>     auto q = Vec([3.5L, 1.1, 3.8]);  // ditto
>>>>>     auto r = Vec([17.0f, 28.25, 1]); // a vector of 32-bit floats
>>>>>     auto z = Vec([17.0i, 28.1i, 1i]); // a vector of 64-bit idoubles
>>>>>     real d = dot(r, p+r+r);
>>>>>     ireal e = dot(r, z);
>>>>>     q -= ((r+p)*18.0L*314.1L - (p-r))* 35;
>>>>>     d = dot(r, p+r+r);
>>>>> }
>>>>>
>>>>> Notice that mixed-length operations (real[] + float[] - double[]) 
>>>>> are supported.
>>>>>
>>>>> Like the C++ Blitz++ library, expression templates are used to 
>>>>> convert vector expressions into efficient element-wise operations. 
>>>>> Unlike that library, however, there is no reliance on the 
>>>>> compiler's optimiser. Instead, the expression template is 
>>>>> manipulated as text, converted into postfix, and then passed to a 
>>>>> simple CTFE compile-time assembler, which creates highly efficient 
>>>>> asm code which is used as a mixin.
>>>>> To understand the later parts of the code, you need some knowledge 
>>>>> of x87 assembler. In fact, you probably need to have read Agner 
>>>>> Fog's superb Pentium optimisation manual (www.agner.org).
>>>>>
>>>>> Some observations:
>>>>> * I was amazed at how simple the expression template code is (it is 
>>>>> somewhat cluttered by the code to check for real/imaginary type 
>>>>> mismatch errors).
>>>>> * I've often read that the x87 floating-point stack is notoriously 
>>>>> difficult for compilers to write code for, but it works quite well 
>>>>> in this case.
>>>>> * The major workarounds are:
>>>>> - inability to use a tuple element directly from asm code (bug #1028);
>>>>> - inability to define operators for built-in arrays (hence the use 
>>>>> of 'Vec' wrappers).
>>>>> - inability to index through a tuple in a CTFE function (solved by 
>>>>> converting types into a string).
>>>>> * There have been mutterings about how unhygenic/dangerous the new 
>>>>> mixins are. In this case, the mixin forms the _entire_ body of the 
>>>>> function. This is an interesting situation which I think a language 
>>>>> purist will find more palatable.
>>>>>
>>>>> Enjoy.
>>>>
>>>> This is a work of art Don - it's practically a compiler extension.  
>>>> Nice job. :)
>>>>
>>>> For others that are interested in how this actually gets the job 
>>>> done, I found this in your documentation:
>>>>
>>>> * THEORY:
>>>> * Expression templates are used to create an expression string of 
>>>> the form "(a+b*c)+d"
>>>> * and a tuple, the entries of which correspond to a, b, c, d, ...
>>>> * This string is converted to postfix. The postfix string is 
>>>> converted to
>>>> * a string containing x87 asm, which is then mixed into a function 
>>>> which accepts the tuple.
>>>>
>>>
>>> Hum, a minor question, is a string representation of the expressions 
>>> better (easier to use, manipulate, etc.) than a tree representation?
>>>
>>
>> I know Don wrote this lib, but I think I can answer this.
>>
>> In short: no.  But it's really an unintentionally loaded question: 
>> there are some rather severe limitations as to what kind of data 
>> structures you can create at compile time.  You're basically limited 
>> to tuples and strings, each of which have drawbacks of their own.
>>
>> You can create a tree using tuples, but then you run into problems 
>> with over-running the limitations on identifier length built into the 
>> compiler. 
> Actually I don't know how to make a tree nicely with tuples.

Here's an example (for grins):

import std.stdio;

template Node(char[] Data,Nodes...){
	alias Data data;
	alias Nodes nodes;
}

template PrintData(char[] parent){
	const char[] PrintData = "";
}

template PrintData(char[] parent,alias Node){
	static if(parent == ""){
		const char[] PrintData = Node.data ~ "\n" ~
			PrintData!(Node.data,Node.nodes);
	}
	else{
		const char[] PrintData = parent ~ "." ~ Node.data ~ "\n" ~
			PrintData!(parent ~ "." ~ Node.data,Node.nodes);
	}
}

template PrintData(char[] parent,alias Node,V...){
	const char[] PrintData = PrintData!(parent,Node) ~ PrintData!(parent,V);
}

// example "tree" structure
alias Node!("1",
	Node!("1"),
	Node!("2",
		Node!("1"),
		Node!("2")
	)
) dataset;

void main(){
	writefln("%s",PrintData!("",dataset));
}

> 
>  Strings are no where near as flexible out of the box, but
>> pose no storage limitations, which gives a slight edge to string-based 
>> representations of data.  With some clever coding, strings can be made 
>> to store just about any structure imaginable (even if it does make for 
>> some ugly code).
> 
> There's a couple of minor things, though: (1) in the case of generating 
> D code to mixin, no tree is required (the string just gets mixed back in 
> again, so it might as well stay as a string).
> (2) The "separate string and tuple" structure means the expression can 
> be optimised without needing to move any of the values around; this 
> creates less garbage for the compiler to optimise away.
> (3) Some patterns are probably easier to find with a string, than a tree.
> (4) It's more natural for D coders to think in terms of arrays, than 
> lists. (and there is much more language support).
> 
> (5) A string representation can easily express things like c=a+3*(a+b);
> where 'a' appears twice. (At present, there's no way to generate it, but 
> theoretically the compiler could provide it many cases).
> 
> So -- if trees were available, I'm not sure if I'd use them, or not.

They're really handy.  I've found some uses for them already:

/*
ZeroOrMoreExpr
	= generate.ZeroOrMoreExpr()
	::= "["  "{"  Expression:expr  "}"  "]"  [RangeSetExpr:ranges ] [Binding:binding ]
		["*" SubExpressionTerm:delim ] [SubExpressionTerm:term];
*/
bool ZeroOrMoreExpr(inout char[] value,inout char[] bindings,inout char[] tokens,inout char[] err){
	return Rule!(
		generate.ZeroOrMoreExpr,
		And!(
			Literal!("["),
			Literal!("{"),
			Bind!(Expression,"expr"),
			Literal!("}"),
			Literal!("]"),
			Optional!(
				Bind!(RangeSetExpr,"ranges")
			),			
			Optional!(
				Bind!(Binding,"binding")
			),
			Optional!(
				And!(
					Literal!("*"),
					Bind!(SubExpressionTerm,"delim")
				)
			),
			Bind!(SubExpressionTerm,"term")
		)
	)(value,bindings,tokens,err);		
}

There's a *lot* going on in this sample.  Basically what you see here is a chunk of the as-of-yet-experimental 
compile-time Enki parser.  This piece parses the Zero-or-more expression part of the EBNF variant that Enki supports.

The templates evaluate to CTFE's that in turn make up the parser when it's executed.  So there's several layers of 
compile-time evaluation going on here.  Also, what's not obvious is that "char[] tokens" is actually an *array of 
strings* that is stored in a single char[] array; each string's length data is encoded as a size_t mapped onto the 
appropriate number of chars.  The Bind!() expressions also map parsed out data to a key/value set which is stored in a 
similar fashion (char[] bindings).

The goals here were to side-step the limitations of D's identifier name-length while providing a user-readable toolkit 
for parser composition; the only limit now is placed on the length of any given rule.  I haven't found a way to unify 
the compile-time and run-time parser generation yet (one backend with two targets), but I anticipate using something 
very similar to the above by providing a CT and RT version of the template library.  At a minimum I plan on having this 
consume a .bnf file at compile-time, in order to create a run-time parser via mixin.

-- 
- EricAnderton at yahoo