DCompute: First kernels run successfully

Mon Sep 11 05:23:16 PDT 2017

I'm pleased to announce that I have run the first dcompute kernel 
and it was a success!

There is still a fair bit of polish to the driver needed to make 
the API sane and more complete, not to mention more similar to 
the (untested) OpenCL driver API. But it works!
(Contributions are of course greatly welcomed)

The kernel:
```
@compute(CompileFor.deviceOnly)
module dcompute.tests.dummykernels;

import ldc.dcompute;
import dcompute.std.index;

@kernel void saxpy(GlobalPointer!(float) res,
                    float alpha,GlobalPointer!(float) x,
                    GlobalPointer!(float) y,
                    size_t N)
{
     auto i = GlobalIndex.x;
     if (i >= N) return;
     res[i] = alpha*x[i] + y[i];
}
```

The host code:
```
import dcompute.driver.cuda;
import dcompute.tests.dummykernels : saxpy;

Platform.initialise();

auto devs   = Platform.getDevices(theAllocator);
auto ctx    = Context(devs[0]); scope(exit) ctx.detach();

// Change the file to match your GPU.
Program.globalProgram = 
Program.fromFile("./.dub/obj/kernels_cuda210_64.ptx");
auto q = Queue(false);

enum size_t N = 128;
float alpha = 5.0;
float[N] res, x,y;
foreach (i; 0 .. N)
{
     x[i] = N - i;
     y[i] = i * i;
}
Buffer!(float) b_res, b_x, b_y;
b_res      =  Buffer!(float)(res[]); scope(exit) b_res.release();
b_x        =  Buffer!(float)(x[]);   scope(exit) b_x.release();
b_y        =  Buffer!(float)(y[]);   scope(exit) b_y.release();

b_x.copy!(Copy.hostToDevice); // not quite sold on this interface 
yet.
b_y.copy!(Copy.hostToDevice);

q.enqueue!(saxpy)  // <-- the main magic happens here
     ([N,1,1],[1,1,1])   // the grid
     (b_res,alpha,b_x,b_y, N); // the kernel arguments

b_res.copy!(Copy.deviceToHost);
foreach(i; 0 .. N)
     enforce(res[i] == alpha * x[i] + y[i]);
writeln(res[]); // [640, 636, ... 16134]
```

Simple as that!

Dcompute, as always, is at https://github.com/libmir/dcompute and 
on dub.

To successfully run the dcompute CUDA test you will need a very 
recent LDC (less than two days) with the NVPTX backend* enabled 
along with a CUDA environment and an Nvidia GPU.

*Or wait for LDC 1.4 release real soon(™).

Thanks to the LDC folks for putting up with me ;)

Have fun GPU programming,
Nic