GPGPUs

Sat Aug 17 01:04:55 PDT 2013

On Saturday, 17 August 2013 at 06:09:53 UTC, Atash wrote:
> On Saturday, 17 August 2013 at 00:53:39 UTC, luminousone wrote:
>> You can't mix cpu and gpu code, they must be separate.
>
> H'okay, let's be clear here. When you say 'mix CPU and GPU 
> code', you mean you can't mix them physically in the compiled 
> executable for all currently extant cases. They aren't the 
> same. I agree with that.
>
> That said, this doesn't preclude having CUDA-like behavior 
> where small functions could be written that don't violate the 
> constraints of GPU code and simultaneously has semantics that 
> could be executed on the CPU, and where such small functions 
> are then allowed to be called from both CPU and GPU code.
>
>> However this still has problems of the cpu having to generate 
>> CPU code from the contents of gpu{} code blocks, as the GPU is 
>> unable to allocate memory, so for example ,
>>
>> gpu{
>>    auto resultGPU = dot(c, cGPU);
>> }
>>
>> likely either won't work, or generates an array allocation in 
>> cpu code before the gpu block is otherwise ran.
>
> I'm fine with an array allocation. I'd 'prolly have to do it 
> anyway.
>
>> Also how does that dot product function know the correct index 
>> range to run on?, are we assuming it knows based on the length 
>> of a?, while the syntax,
>>
>> c[] = a[] * b[];
>>
>> is safe for this sort of call, a function is less safe todo 
>> this with, with function calls the range needs to be told to 
>> the function, and you would call this function without the 
>> gpu{} block as the function itself is marked.
>>
>> auto resultGPU = dot$(0 .. 
>> returnLesser(cGPU.length,dGPU.length))(cGPU, dGPU);
>
> 'Dat's a point.
>
>> Remember with gpu's you don't send instructions, you send 
>> whole programs, and the whole program must finish before you 
>> can move onto the next cpu instruction.
>
> I disagree with the assumption that the CPU must wait for the 
> GPU while the GPU is executing. Perhaps by default the behavior 
> could be helpful for sequencing global memory in the GPU with 
> CPU operations, but it's not a *necessary* behavior.
>
> Well, I disagree with the assumption assuming said assumption 
> is being made and I'm not just misreading that bit. :-P
>
> === Another thing...
>
> I'm with luminousone's suggestion for some manner of function 
> attribute, to the tune of several metric tonnes of chimes. Wind 
> chimes. I'm supporting this suggestion with at least a metric 
> tonne of wind chimes.
>
> I'd prefer this (and some small number of helpers) rather than 
> straight-up dumping a new keyword and block type into the 
> language. I really don't think D *needs* to have this any lower 
> level than a library based solution, because it already has the 
> tools to make it ridiculously more convenient than C/C++ (not 
> necessarily as much as CUDA's totally separate program nvcc 
> does, but a huge amount).
>
> ex.
>
>
> @kernel auto myFun(BufferT)(BufferT glbmem)
> {
>   // brings in the kernel keywords and whatnot depending 
> __FUNCTION__
>   // (because mixins eval where they're mixed in)
>   mixin KernelDefs;
>   // ^ and that's just about all the syntactic noise, the rest 
> uses mixed-in
>   //   keywords and the glbmem object to define several 
> expressions that
>   //   effectively record the operations to be performed into 
> the return type
>
>   // assignment into global memory recovers the expression type 
> in the glbmem.
>   glbmem[glbid] += 4;
>
>   // This assigns the *expression* glbmem[glbid] to val.
>   auto val = glbmem[glbid];
>
>   // Ignoring that this has a data race, this exemplifies 
> recapturing the
>   // expression 'val' (glbmem[glbid]) in glbmem[glbid+1].
>   glbmem[glbid+1] = val;
>
>   return glbmem; ///< I lied about the syntactic noise. This is 
> the last bit.
> }
>
>
> Now if you want to, you can at runtime create an OpenCL-code 
> string (for example) by passing a heavily metaprogrammed type 
> in as BufferT. The call ends up looking like this:
>
>
> auto promisedFutureResult = Gpu.call!myFun(buffer);
>
>
> The kernel compilation (assuming OpenCL) is memoized, and the 
> promisedFutureResult is some asynchronous object that 
> implements concurrent programming's future (or something to 
> that extent). For convenience, let's say that it blocks on any 
> read other than some special poll/checking mechanism.
>
> The constraints imposed on the kernel functions is 
> generalizable to even execute the code on the CPU, as the 
> launching call ( Gpu.call!myFun(buffer) ) can, instead of using 
> an expression-buffer, just pass a normal array in and have the 
> proper result pop out given some interaction between the 
> identifiers mixed in by KernelDefs and the launching caller 
> (ex. using a loop).
>
> Alternatively to returning the captured expressions, the 
> argument glbmem could have been passed ref, and the same sort 
> of expression capturing could occur. Heck, more arguments 
> could've been passed, too, this doesn't require there to be one 
> single argument representing global memory.
>
> With CTFE, this method *I think* can also generate the code at 
> compile time given the proper kind of 
> expression-type-recording-BufferT.
>
> Again, though, all this requires a significant amount of 
> metaprogramming, heavy abuse of auto, and... did I mention a 
> significant amount of metaprogramming? It's roughly the same 
> method I used to embed OpenCL code in a C++ project of mine 
> without writing a single line of OpenCL code, however, so I 
> *know* it's doable, likely even moreso, in D.

Often when first introducing programmers to gpu programming, they 
imagine gpu instructions as being part of the instruction stream 
the cpu receives, completely missing the point of what makes the 
entire scheme so useful.

The gpu might better be imagined as a wholly separate computer, 
that happens to be networked via the system bus. Every 
interaction between the cpu and the gpu has to travel across this 
expensive comparatively high latency divide, so the goal is 
design that makes it easy to avoid interaction between the two 
separate entities as much as possible while still getting the 
maximum control and performance from them.

Opencl may have picked the term __KERNEL based on the idea that 
the gpu program in fact represents the devices operating system 
for the duration of that function call.

Single Statement code operations on the GPU, in this vain 
represent a horridly bad idea. so ...

gpu{
    c[] = a[] * b[];
}

seems like very bad design to me.

In fact being able to have random gpu {} code blocks seems like a 
bad idea in this vain. each line in such a block very likely 
would end up being separate gpu __KERNEL functions creating 
excessive amounts of cpu/gpu interaction, as each line may have 
different ranges.

The foreach loop type is actually fits the model of 
microthreading very nicely. It has a clearly defined range, their 
is no dependency related to the order in which the code functions 
on any arrays used in the loop, you have an implicit index that 
is unique for each value in the range, you can't change the size 
of the range mid execution(at least I haven't seen anyone do it 
so far).