How about implementing SPMD on SIMD for D?

Sun Jul 8 19:07:57 UTC 2018

On Saturday, 7 July 2018 at 13:26:10 UTC, Guillaume Piolat wrote:
> On Friday, 6 July 2018 at 23:08:27 UTC, Random D user wrote:
>> Especially, since D doesn't even attempt any 
>> auto-vectorization (poor results and difficult to implement) 
>> and manual loops are quite tedious to write (even std.simd 
>> failed to materialize), so SPMD would be nice alternative.
>
> I think you are mistaken, D code is autovectorized often when 
> using LDC.

That is good to know.
I haven't looked that much into LDC (or clang). I mostly use dmd 
for fast edit-compile cycle. Although, plan is to use LDC for 
"release"/optimized build eventually.

Anyway, I would just want to code some non-trivial loops in SIMD, 
but I wouldn't want to fiddle with intrinsics. Or write a higher 
level wrapper for them.

In my experience, you can only get the real benefits out of SIMD 
if you carefully handcraft your hot loops to fully use it. 
Sprinkling some SIMD here and there with a SIMD vector type, 
doesn't really seem to yield big benefits.

>
> Sometimes it's not and it's hard to know why.

Exactly.
In my experience compilers (msvc) often don't.

> A pragma we could have is the one in the Intel C++ Compiler 
> that says "hey this loop is safe to autovectorize".
>
>> What do you think?
>
> I think that ispc is like OpenCL on the CPU, but can't work on 
> the GPU, FPGA or other OpenCL implementation. OpenCL is so fast 
> because caching is explicit (several levels of memory are 
> exposed).

Yeah, it should be similar. The point is not run it on GPU, you 
can do CUDA, OpenCL, compute shader etc. for that.
CPU code is much easier to debug, and sometimes you're already 
doing things on the GPU, but your CPU side has more room for 
computation. And you don't have to copy your data between the GPU 
and CPU or deal with latency.
Of course, OpenCL runs on CPU too, but I think there's quite a 
bit of code required to set it up and to use it.

I guess my point was that I would like to do CPU SIMD code easily 
without intrinsics (or manually trying to trick the compiler to 
vectorize the code). SPMD stuff seems to solve these issues. It 
would also be a forward looking step for D.

Ideally, just write your loop normally, debug it and add an 
annotation to get it to run fast on SIMD. Done.