GPGPUs

Fri Aug 16 22:51:05 PDT 2013

On Saturday, 17 August 2013 at 00:53:39 UTC, luminousone wrote:
> You can't mix cpu and gpu code, they must be separate.

H'okay, let's be clear here. When you say 'mix CPU and GPU code', 
you mean you can't mix them physically in the compiled executable 
for all currently extant cases. They aren't the same. I agree 
with that. That said, this doesn't preclude having CUDA-like 
behavior where small functions could be written that don't 
violate the constraints of GPU code and simultaneously has 
semantics that could be executed on the CPU, and where such small 
functions are then allowed to be called from both CPU and GPU 
code.

> However this still has problems of the cpu having to generate 
> CPU code from the contents of gpu{} code blocks, as the GPU is 
> unable to allocate memory, so for example ,
>
> gpu{
>     auto resultGPU = dot(c, cGPU);
> }
>
> likely either won't work, or generates an array allocation in 
> cpu code before the gpu block is otherwise ran.

I wouldn't be so negative with the 'won't work' bit, 'cuz frankly 
the 'or' you wrote there is semantically like what OpenCL and 
CUDA do anyway.

> Also how does that dot product function know the correct index 
> range to run on?, are we assuming it knows based on the length 
> of a?, while the syntax,
>
> c[] = a[] * b[];
>
> is safe for this sort of call, a function is less safe todo 
> this with, with function calls the range needs to be told to 
> the function, and you would call this function without the 
> gpu{} block as the function itself is marked.
>
> auto resultGPU = dot$(0 .. 
> returnLesser(cGPU.length,dGPU.length))(cGPU, dGPU);

I think it was mentioned earlier that there should be, much like 
in OpenCL or CUDA, builtins or otherwise available symbols for 
getting the global identifier of each work-item, the work-group 
size, global size, etc.

> Remember with gpu's you don't send instructions, you send whole 
> programs, and the whole program must finish before you can move 
> onto the next cpu instruction.

I disagree with the assumption that the CPU must wait for the GPU 
while the GPU is executing. Perhaps by default the behavior could 
be helpful for sequencing global memory in the GPU with CPU 
operations, but it's not a necessary behavior (see OpenCL and 
it's, in my opinion, really nice queuing mechanism).

=== Another thing...

I'm with luminousone's suggestion for some manner of function 
attribute, to the tune of several metric tonnes of chimes. Wind 
chimes. I'm supporting this suggestion with at least a metric 
tonne of wind chimes.

*This* (and some small number of helpers), rather than 
straight-up dumping a new keyword and block type into the 
language. I really don't think D *needs* to have this any lower 
level than a library based solution, because it already has the 
tools to make it ridiculously more convenient than C/C++ (not 
necessarily as much as CUDA's totally separate program nvcc does, 
but a huge amount).

ex.

@kernel auto myFun(BufferT)(BufferT glbmem)
{
   // brings in the kernel keywords and whatnot depending 
__FUNCTION__
   // (because mixins eval where they're mixed in)
   mixin KernelDefs;
   // ^ and that's just about all the syntactic noise, the rest 
uses mixed-in
   //   keywords and the glbmem object to define several 
expressions that
   //   effectively record the operations to be performed into the 
return type

   // assignment into global memory recovers the expression type 
in the glbmem.
   glbmem[glbid] += 4;

   // This assigns the *expression* glbmem[glbid] to val.
   auto val = glbmem[glbid];

   // Ignoring that this has a data race, this exemplifies 
recapturing the
   // expression 'val' (glbmem[glbid]) in glbmem[glbid+1].
   glbmem[glbid+1] = val;

   return glbmem; ///< I lied about the syntactic noise. This is 
the last bit.
}

Now if you want to, you can at runtime create an OpenCL-code 
string (for example) by passing a heavily metaprogrammed type in 
as BufferT. The call ends up looking like this:

auto promisedFutureResult = Gpu.call!myFun(buffer);

The kernel compilation (assuming OpenCL) is memoized, and the 
promisedFutureResult is some asynchronous object that implements 
concurrent programming's future (or something to that extent). 
For convenience, let's say that it blocks on any read other than 
some special poll/checking mechanism.

The constraints imposed on the kernel functions is generalizable 
to even execute the code on the CPU, as the launching call ( 
Gpu.call!myFun(buffer) ) can, instead of using an 
expression-buffer, just pass a normal array in and have the 
proper result pop out given some interaction between the 
identifiers mixed in by KernelDefs and the launching caller (ex. 
using a loop).

With CTFE, this method *I think* can also generate the code at 
compile time given the proper kind of 
expression-type-recording-BufferT.

Again, though, this requires a significant amount of 
metaprogramming, heavy abuse of auto, and... did i mention a 
significant amount of metaprogramming? It's roughly the same 
method I used to embed OpenCL code in a C++ project of mine 
without writing a single line of OpenCL code, however, so I 
*know* it's doable, likely even moreso, in D.