The Matrix to end all Matrix classes (Let's dream!) - BLADE 0.4Alpha

Tue Nov 20 08:22:00 PST 2007

"Don Clugston" <dac at nospam.com.au> wrote in message 
news:fhv00l$13eq$1 at digitalmars.com...
> For reference, here's the state of BLADE:
>
> Syntax is:
>
> mixin(vectorize("expression"));
>
> eg,
>
> mixin(vectorize("some_array -= 3.2 * (5.8 * another_array) * 
> some_constant"));
>
> It would be so nice to get rid of the mixin() and the quotes. But they 
> don't impede development in any way. Ideally it would be:
>
> vectorize(some_array -= 3.2 * (5.8 * another_array) * some_constant);
>
> Most significant changes since I my last post about it.
> (a) it finally works in a usable form :-). Thanks Walter for some great 
> bugfixes in the past few releases!  I'm using dmd 1.023.
> (b) generates X87, SSE, SSE2, or inline D code, depending on the 
> complexity of the expression and the types involved.
> (c) an expression rewriting step has been added, which performs 
> scalar/const folding, and indexing/slicing folding
>    (eg, (A+7*B)[0..5] --->   A[0..5] + 7*(B[0..5]) ).
> (d) it gives really, really nice error messages and debug output. No 
> template garbage from the library internals -- just a straightforward 
> one-line error message, giving the line in YOUR code which contains the 
> error.
>
> eg,   with an array a[],
>     mixin(vectorize("any+= old*garbage"));
>     mixin(vectorize("a+= 2"));
>
> you get this output (and NOTHING ELSE):
> demo.d(38): static assert  "BLADE: Undefined symbols: any old garbage"
> demo.d(39): static assert  "BLADE: Rank mismatch (addition or 
> subtraction)"
>
> It's still not terribly useful, since it only generates code for packed 
> vectors. But the infrastructure is very solid.
>
> Here's a particularly nasty example which the constant folding can cope 
> with, to generate SSE2 code:
>
>     double [] a = new double[4];
>     double [] d = [0.5, 2.8, 17.0, 28.25, 1, 56.2, 3.4];
>     a[0..$] = [3.4, 565, 31.3, 41.8];
>     double [4][] another = [[33.1, 4543, 43, 878.7], [5.14, 455, 554, 
> 2.43]];
>
>     mixin(vectorize(
> ` a += (d[2..$-1]*2.01*a[2]-another[][1])["abc".length-3..$]`));
>
> -------
> Generates this front-end code (compile with -debug=BladeFrontEnd to see 
> it). Note that there are many asserts to give nice debug info at runtime, 
> but the only runtime code is a single function call, which passes 3 
> pointers and a double into an asm function (there's no inlining work for 
> the compiler to do):
>
> ------
> // bladedemo.d(34)  a += 
> (d[2..$-1]*2.01*a[2]-another[][1])["abc".length-3..$]
> assert(a.length==another[][1][(3u-3)..$].length, `Vector length 
> mismatch`);
> assert(d[2..($-1)][(3u-3)..$].length==another[][1][(3u-3)..$].length, 
> `Vector length mismatch`);
> assert( (cast(size_t)(a.ptr)& 0x0F) == 0, `SSE Vector misalignment: a`);
> assert( (cast(size_t)(d[2..($-1)][(3u-3)..$].ptr)& 0x0F) == 0, `SSE Vector 
> misalignment: d[2..($-1)][(3u-3)..$]`);
> assert( (cast(size_t)(another[][1][(3u-3)..$].ptr)& 0x0F) == 0, `SSE 
> Vector misalignment: another[][1][(3u-3)..$]`);
>
> SSEVECGEN!(2,"A+=((B*C)-D)",double*,double,double*,double*)(another[][1][(3u-3)..$].length,&a[0],((a[2])*2.01),&d[2..($-1)][(3u-3)..$][0],&another[][1][(3u-3)..$][0]);
>
> -----
> The function consists of this ASM code (compile with -debug=BladeBackEnd 
> to see it; BTW all comments are auto-generated). Note there are only 8 asm 
> instructions in the inner loop:
>
> -------
> // Operation : ACB*D-+A=
>
> asm {
>  push EBX;
>   mov EAX, veclength;
>   lea ECX, [8*EAX];     add ECX, values[0];  //A
>   movsd XMM0, values[1];   shufpd XMM0, XMM0,0; //B
>   lea EDX, [8*EAX];     add EDX, values[2];  //C
>   lea EBX, [8*EAX];     add EBX, values[3];  //D
>   xor EAX, EAX;
>   sub EAX, veclength; // counter=-length
>   jz short L2; // test for length==0
>
>   align 16;
> L1:
>   movapd XMM1, [ECX + 8*EAX];  // A
>   movapd XMM2, [EDX + 8*EAX];  // C
>   mulpd XMM2, XMM0; // B*
>   subpd  XMM2, [EBX + 8*EAX];  // D-
>   addpd XMM1, XMM2;  //+
>   movapd [ECX + 8*EAX], XMM1;  // A=
>   add EAX,2;
>   js L1;
> L2:
>   sub EAX, 2;
>   jns L4;
>   movsd XMM1, [ECX + 8*EAX+16];  // A
>   movsd XMM2, [EDX + 8*EAX+16];  // C
>   mulsd XMM2, XMM0; // B*
>   subsd  XMM2, [EBX + 8*EAX+16];  // D-
>   addsd XMM1, XMM2;  //+
>   movsd [ECX + 8*EAX+16], XMM1;  // A=
> L4:
> ;  pop EBX;
> }
> -------

Another question.  How practical would this mixin technique be to generate 
optimal floating point code for regular floating point operations (not 
vectors or matrices)?  The DMD backend currently doesn't optimize floating 
point very well.  Perhaps this approach could be used for floating point 
operations in general.