GPGPUs

Fri Aug 16 23:09:51 PDT 2013

On Saturday, 17 August 2013 at 00:53:39 UTC, luminousone wrote:
> You can't mix cpu and gpu code, they must be separate.

H'okay, let's be clear here. When you say 'mix CPU and GPU code', 
you mean you can't mix them physically in the compiled executable 
for all currently extant cases. They aren't the same. I agree 
with that.

That said, this doesn't preclude having CUDA-like behavior where 
small functions could be written that don't violate the 
constraints of GPU code and simultaneously has semantics that 
could be executed on the CPU, and where such small functions are 
then allowed to be called from both CPU and GPU code.

> However this still has problems of the cpu having to generate 
> CPU code from the contents of gpu{} code blocks, as the GPU is 
> unable to allocate memory, so for example ,
>
> gpu{
>     auto resultGPU = dot(c, cGPU);
> }
>
> likely either won't work, or generates an array allocation in 
> cpu code before the gpu block is otherwise ran.

I'm fine with an array allocation. I'd 'prolly have to do it 
anyway.

> Also how does that dot product function know the correct index 
> range to run on?, are we assuming it knows based on the length 
> of a?, while the syntax,
>
> c[] = a[] * b[];
>
> is safe for this sort of call, a function is less safe todo 
> this with, with function calls the range needs to be told to 
> the function, and you would call this function without the 
> gpu{} block as the function itself is marked.
>
> auto resultGPU = dot$(0 .. 
> returnLesser(cGPU.length,dGPU.length))(cGPU, dGPU);

'Dat's a point.

> Remember with gpu's you don't send instructions, you send whole 
> programs, and the whole program must finish before you can move 
> onto the next cpu instruction.

I disagree with the assumption that the CPU must wait for the GPU 
while the GPU is executing. Perhaps by default the behavior could 
be helpful for sequencing global memory in the GPU with CPU 
operations, but it's not a *necessary* behavior.

Well, I disagree with the assumption assuming said assumption is 
being made and I'm not just misreading that bit. :-P

=== Another thing...

I'm with luminousone's suggestion for some manner of function 
attribute, to the tune of several metric tonnes of chimes. Wind 
chimes. I'm supporting this suggestion with at least a metric 
tonne of wind chimes.

I'd prefer this (and some small number of helpers) rather than 
straight-up dumping a new keyword and block type into the 
language. I really don't think D *needs* to have this any lower 
level than a library based solution, because it already has the 
tools to make it ridiculously more convenient than C/C++ (not 
necessarily as much as CUDA's totally separate program nvcc does, 
but a huge amount).

ex.

@kernel auto myFun(BufferT)(BufferT glbmem)
{
   // brings in the kernel keywords and whatnot depending 
__FUNCTION__
   // (because mixins eval where they're mixed in)
   mixin KernelDefs;
   // ^ and that's just about all the syntactic noise, the rest 
uses mixed-in
   //   keywords and the glbmem object to define several 
expressions that
   //   effectively record the operations to be performed into the 
return type

   // assignment into global memory recovers the expression type 
in the glbmem.
   glbmem[glbid] += 4;

   // This assigns the *expression* glbmem[glbid] to val.
   auto val = glbmem[glbid];

   // Ignoring that this has a data race, this exemplifies 
recapturing the
   // expression 'val' (glbmem[glbid]) in glbmem[glbid+1].
   glbmem[glbid+1] = val;

   return glbmem; ///< I lied about the syntactic noise. This is 
the last bit.
}

Now if you want to, you can at runtime create an OpenCL-code 
string (for example) by passing a heavily metaprogrammed type in 
as BufferT. The call ends up looking like this:

auto promisedFutureResult = Gpu.call!myFun(buffer);

The kernel compilation (assuming OpenCL) is memoized, and the 
promisedFutureResult is some asynchronous object that implements 
concurrent programming's future (or something to that extent). 
For convenience, let's say that it blocks on any read other than 
some special poll/checking mechanism.

The constraints imposed on the kernel functions is generalizable 
to even execute the code on the CPU, as the launching call ( 
Gpu.call!myFun(buffer) ) can, instead of using an 
expression-buffer, just pass a normal array in and have the 
proper result pop out given some interaction between the 
identifiers mixed in by KernelDefs and the launching caller (ex. 
using a loop).

Alternatively to returning the captured expressions, the argument 
glbmem could have been passed ref, and the same sort of 
expression capturing could occur. Heck, more arguments could've 
been passed, too, this doesn't require there to be one single 
argument representing global memory.

With CTFE, this method *I think* can also generate the code at 
compile time given the proper kind of 
expression-type-recording-BufferT.

Again, though, all this requires a significant amount of 
metaprogramming, heavy abuse of auto, and... did I mention a 
significant amount of metaprogramming? It's roughly the same 
method I used to embed OpenCL code in a C++ project of mine 
without writing a single line of OpenCL code, however, so I 
*know* it's doable, likely even moreso, in D.