GPGPUs
Atash
nope at nope.nope
Fri Aug 16 23:09:51 PDT 2013
On Saturday, 17 August 2013 at 00:53:39 UTC, luminousone wrote:
> You can't mix cpu and gpu code, they must be separate.
H'okay, let's be clear here. When you say 'mix CPU and GPU code',
you mean you can't mix them physically in the compiled executable
for all currently extant cases. They aren't the same. I agree
with that.
That said, this doesn't preclude having CUDA-like behavior where
small functions could be written that don't violate the
constraints of GPU code and simultaneously has semantics that
could be executed on the CPU, and where such small functions are
then allowed to be called from both CPU and GPU code.
> However this still has problems of the cpu having to generate
> CPU code from the contents of gpu{} code blocks, as the GPU is
> unable to allocate memory, so for example ,
>
> gpu{
> auto resultGPU = dot(c, cGPU);
> }
>
> likely either won't work, or generates an array allocation in
> cpu code before the gpu block is otherwise ran.
I'm fine with an array allocation. I'd 'prolly have to do it
anyway.
> Also how does that dot product function know the correct index
> range to run on?, are we assuming it knows based on the length
> of a?, while the syntax,
>
> c[] = a[] * b[];
>
> is safe for this sort of call, a function is less safe todo
> this with, with function calls the range needs to be told to
> the function, and you would call this function without the
> gpu{} block as the function itself is marked.
>
> auto resultGPU = dot$(0 ..
> returnLesser(cGPU.length,dGPU.length))(cGPU, dGPU);
'Dat's a point.
> Remember with gpu's you don't send instructions, you send whole
> programs, and the whole program must finish before you can move
> onto the next cpu instruction.
I disagree with the assumption that the CPU must wait for the GPU
while the GPU is executing. Perhaps by default the behavior could
be helpful for sequencing global memory in the GPU with CPU
operations, but it's not a *necessary* behavior.
Well, I disagree with the assumption assuming said assumption is
being made and I'm not just misreading that bit. :-P
=== Another thing...
I'm with luminousone's suggestion for some manner of function
attribute, to the tune of several metric tonnes of chimes. Wind
chimes. I'm supporting this suggestion with at least a metric
tonne of wind chimes.
I'd prefer this (and some small number of helpers) rather than
straight-up dumping a new keyword and block type into the
language. I really don't think D *needs* to have this any lower
level than a library based solution, because it already has the
tools to make it ridiculously more convenient than C/C++ (not
necessarily as much as CUDA's totally separate program nvcc does,
but a huge amount).
ex.
@kernel auto myFun(BufferT)(BufferT glbmem)
{
// brings in the kernel keywords and whatnot depending
__FUNCTION__
// (because mixins eval where they're mixed in)
mixin KernelDefs;
// ^ and that's just about all the syntactic noise, the rest
uses mixed-in
// keywords and the glbmem object to define several
expressions that
// effectively record the operations to be performed into the
return type
// assignment into global memory recovers the expression type
in the glbmem.
glbmem[glbid] += 4;
// This assigns the *expression* glbmem[glbid] to val.
auto val = glbmem[glbid];
// Ignoring that this has a data race, this exemplifies
recapturing the
// expression 'val' (glbmem[glbid]) in glbmem[glbid+1].
glbmem[glbid+1] = val;
return glbmem; ///< I lied about the syntactic noise. This is
the last bit.
}
Now if you want to, you can at runtime create an OpenCL-code
string (for example) by passing a heavily metaprogrammed type in
as BufferT. The call ends up looking like this:
auto promisedFutureResult = Gpu.call!myFun(buffer);
The kernel compilation (assuming OpenCL) is memoized, and the
promisedFutureResult is some asynchronous object that implements
concurrent programming's future (or something to that extent).
For convenience, let's say that it blocks on any read other than
some special poll/checking mechanism.
The constraints imposed on the kernel functions is generalizable
to even execute the code on the CPU, as the launching call (
Gpu.call!myFun(buffer) ) can, instead of using an
expression-buffer, just pass a normal array in and have the
proper result pop out given some interaction between the
identifiers mixed in by KernelDefs and the launching caller (ex.
using a loop).
Alternatively to returning the captured expressions, the argument
glbmem could have been passed ref, and the same sort of
expression capturing could occur. Heck, more arguments could've
been passed, too, this doesn't require there to be one single
argument representing global memory.
With CTFE, this method *I think* can also generate the code at
compile time given the proper kind of
expression-type-recording-BufferT.
Again, though, all this requires a significant amount of
metaprogramming, heavy abuse of auto, and... did I mention a
significant amount of metaprogramming? It's roughly the same
method I used to embed OpenCL code in a C++ project of mine
without writing a single line of OpenCL code, however, so I
*know* it's doable, likely even moreso, in D.
More information about the Digitalmars-d
mailing list