How about implementing SPMD on SIMD for D?

Fri Jul 6 23:08:27 UTC 2018

TL;DR
Would want to run your code 8x - 32x faster? SPMD (Single Program 
Multiple Data) on SIMD (Single Instruction Multiple Data) might 
be the answer you're looking for.
It works by running multiple iterations/instances of your loop at 
once on SIMD and the compiler could do that automatically for you 
and your normal loop & array code.

---

I'm a bit late to the party, but I recently was reading this ( 
http://pharr.org/matt/blog/2018/04/30/ispc-all.html ), a highly 
interesting blog post series about how one guy did what the Intel 
compiler team wouldn't or couldn't do.
He wrote a C like language and compiler on top of LLVM which 
transforms normal scalar code into "parallel" SIMD code.
That compiler is called the ISPC ( 
https://ispc.github.io/perf.html ).

It basically works the similarly as GPU shaders, but the code 
runs on the CPU SIMD.
You write your code for one thread/lane and the compiler then 
runs N instances of that code simultaneously in lockstep.
For example, loop 8x (c.xyzw = a.xyzw + b.xyzw) would become 2x 
(x.cccc = x.aaaa + x.bbbb; y.cccc = y.aaaa + y.bbbb; z.cccc = 
z.aaaa + z.bbbb; w.cccc = w.aaaa + w.bbbb) (the notation here is 
a bit weird, but I was trying to keep it short).
Branches are done using masking, so the code runs both sides of 
the branch, but masks away the wrong results.
All of this is way better described in the paper they wrote about 
it ( http://pharr.org/matt/papers/ispc_inpar_2012.pdf ). I 
recommend reading it.

I was also looking at some videos from Unity (game 
engine/framework) about their new "Performance by default" 
initiative.
They are building a custom subset of C# with their own compiler 
to native code. It looks like the subset is just C# with structs, 
functions, slices and annotations (no classes).
That reminded me of D :).
One thing they touched was pointer aliasing and how slices and 
custom compiler tech (that knows about the other engine systems) 
allows them to avoid aliasing and produce more optimal code.
However the interesting part was that the compiler does similar 
things as the ISPC when specific annotations are given by the 
programmer.
Video about the tech/compiler is here ( 
https://www.youtube.com/watch?v=NF6kcNS6U80&feature=youtu.be?list=PLX2vGYjWbI0S8ujCJKYT-mIZf7YCuF-Ka ).

It occurred to me that SPMD on SIMD would be really nice addition 
to D's arsenal.
Especially, since D doesn't even attempt any auto-vectorization 
(poor results and difficult to implement) and manual loops are 
quite tedious to write (even std.simd failed to materialize), so 
SPMD would be nice alternative.
D also has some existing vector syntax and specialization, so 
there's a precedent for vector programing. This could be 
considered as an extension to that.
The SPMD should be easy to implement (I'm not a compiler expert) 
since it's only a code transformation and not an optimization.

Finally, I don't think any serious systems/performance oriented 
language can ignore that kind of performance-increase figures for 
too long.

I had something like this in mind:

@spmd  //or @simd  // NOTE: just removing @spmd would mean it's a 
normal loop, great for debugging
foreach( int i; 0 .. 100 )
{
     c[i] = a[i] + b[i];
}

or

void doSum( float4[] a, float4[] b, float4[] c ) @spmd  //or @simd
{
     c = a + b;  // NOTE: c[i] = a[i] + b[i], array index is 
implicit because of @spmd, it's just some index of 0 .. a.length
}

What do you think?