The Matrix to end all Matrix classes (Let's dream!) - BLADE 0.4Alpha

Tue Nov 20 08:17:03 PST 2007

"Don Clugston" <dac at nospam.com.au> wrote in message 
news:fhv00l$13eq$1 at digitalmars.com...
> For reference, here's the state of BLADE:
>
> Syntax is:
>
> mixin(vectorize("expression"));
>
> eg,
>
> mixin(vectorize("some_array -= 3.2 * (5.8 * another_array) * 
> some_constant"));
>
> It would be so nice to get rid of the mixin() and the quotes. But they 
> don't impede development in any way. Ideally it would be:
>
> vectorize(some_array -= 3.2 * (5.8 * another_array) * some_constant);
>
> Most significant changes since I my last post about it.
> (a) it finally works in a usable form :-). Thanks Walter for some great 
> bugfixes in the past few releases!  I'm using dmd 1.023.
> (b) generates X87, SSE, SSE2, or inline D code, depending on the 
> complexity of the expression and the types involved.
> (c) an expression rewriting step has been added, which performs 
> scalar/const folding, and indexing/slicing folding
>    (eg, (A+7*B)[0..5] --->   A[0..5] + 7*(B[0..5]) ).
> (d) it gives really, really nice error messages and debug output. No 
> template garbage from the library internals -- just a straightforward 
> one-line error message, giving the line in YOUR code which contains the 
> error.
>
> eg,   with an array a[],
>     mixin(vectorize("any+= old*garbage"));
>     mixin(vectorize("a+= 2"));
>
> you get this output (and NOTHING ELSE):
> demo.d(38): static assert  "BLADE: Undefined symbols: any old garbage"
> demo.d(39): static assert  "BLADE: Rank mismatch (addition or 
> subtraction)"
>
> It's still not terribly useful, since it only generates code for packed 
> vectors. But the infrastructure is very solid.
>
> Here's a particularly nasty example which the constant folding can cope 
> with, to generate SSE2 code:
>
>     double [] a = new double[4];
>     double [] d = [0.5, 2.8, 17.0, 28.25, 1, 56.2, 3.4];
>     a[0..$] = [3.4, 565, 31.3, 41.8];
>     double [4][] another = [[33.1, 4543, 43, 878.7], [5.14, 455, 554, 
> 2.43]];
>
>     mixin(vectorize(
> ` a += (d[2..$-1]*2.01*a[2]-another[][1])["abc".length-3..$]`));
>
> -------
> Generates this front-end code (compile with -debug=BladeFrontEnd to see 
> it). Note that there are many asserts to give nice debug info at runtime, 
> but the only runtime code is a single function call, which passes 3 
> pointers and a double into an asm function (there's no inlining work for 
> the compiler to do):
>
> ------
> // bladedemo.d(34)  a += 
> (d[2..$-1]*2.01*a[2]-another[][1])["abc".length-3..$]
> assert(a.length==another[][1][(3u-3)..$].length, `Vector length 
> mismatch`);
> assert(d[2..($-1)][(3u-3)..$].length==another[][1][(3u-3)..$].length, 
> `Vector length mismatch`);
> assert( (cast(size_t)(a.ptr)& 0x0F) == 0, `SSE Vector misalignment: a`);
> assert( (cast(size_t)(d[2..($-1)][(3u-3)..$].ptr)& 0x0F) == 0, `SSE Vector 
> misalignment: d[2..($-1)][(3u-3)..$]`);
> assert( (cast(size_t)(another[][1][(3u-3)..$].ptr)& 0x0F) == 0, `SSE 
> Vector misalignment: another[][1][(3u-3)..$]`);
>
> SSEVECGEN!(2,"A+=((B*C)-D)",double*,double,double*,double*)(another[][1][(3u-3)..$].length,&a[0],((a[2])*2.01),&d[2..($-1)][(3u-3)..$][0],&another[][1][(3u-3)..$][0]);
>
> -----
> The function consists of this ASM code (compile with -debug=BladeBackEnd 
> to see it; BTW all comments are auto-generated). Note there are only 8 asm 
> instructions in the inner loop:
>
> -------
> // Operation : ACB*D-+A=
>
> asm {
>  push EBX;
>   mov EAX, veclength;
>   lea ECX, [8*EAX];     add ECX, values[0];  //A
>   movsd XMM0, values[1];   shufpd XMM0, XMM0,0; //B
>   lea EDX, [8*EAX];     add EDX, values[2];  //C
>   lea EBX, [8*EAX];     add EBX, values[3];  //D
>   xor EAX, EAX;
>   sub EAX, veclength; // counter=-length
>   jz short L2; // test for length==0
>
>   align 16;
> L1:
>   movapd XMM1, [ECX + 8*EAX];  // A
>   movapd XMM2, [EDX + 8*EAX];  // C
>   mulpd XMM2, XMM0; // B*
>   subpd  XMM2, [EBX + 8*EAX];  // D-
>   addpd XMM1, XMM2;  //+
>   movapd [ECX + 8*EAX], XMM1;  // A=
>   add EAX,2;
>   js L1;
> L2:
>   sub EAX, 2;
>   jns L4;
>   movsd XMM1, [ECX + 8*EAX+16];  // A
>   movsd XMM2, [EDX + 8*EAX+16];  // C
>   mulsd XMM2, XMM0; // B*
>   subsd  XMM2, [EBX + 8*EAX+16];  // D-
>   addsd XMM1, XMM2;  //+
>   movsd [ECX + 8*EAX+16], XMM1;  // A=
> L4:
> ;  pop EBX;
> }
> -------

Very very cool Don!  This obviously has huge potential.  Do you have any 
plans to support GPU's or multicores?  Have you looked into SSE3 or SSE4 to 
see if there are any new instructions that would be useful?

Double precision support is the most important to me, but I am also 
interested in single precision and extended precision.  To what degree are 
they supported?  I assume that SSE instructions can't be used for extended 
precision operations.

So there is only support for one dimensional arrays?  When will two 
dimensional arrays be supported?  Once two dimensional arrays are 
implemented, a simple test for the code generator would be to generate 
optimized code for single precision multiplication of 3x3 or 4x4 matrices. 
Optimal code for single precision 3x3 and 4x4 matrix multiplication is 
widely available on the internet.

-Craig