GPGPUs

Fri Aug 16 12:55:54 PDT 2013

> The core (!) point here is that processor chips are rapidly 
> becoming a
> collection of heterogeneous cores. Any programming language 
> that assumes
> a single CPU or a collection of homogeneous CPUs has built-in
> obsolescence.
>
> So the question I am interested in is whether D is the language 
> that can
> allow me to express in a single codebase a program in which 
> parts will
> be executed on one or more GPGPUs and parts on multiple CPUs. D 
> has
> support for the latter, std.parallelism and std.concurrency.
>
> I guess my question is whether people are interested in 
> std.gpgpu (or
> some more sane name).

CUDA, works as a preprocessor pass that generates c files from 
.cu extension files.

In effect, to create a sensible environment for microthreaded 
programming, they extend the language.

a basic CUDA function looking something like...

__global__ void add( float * a, float * b, float * c) {
    int i = threadIdx.x;
    c[i] = a[i] + b[i];
}

add<<< 1, 10 >>>( ptrA, ptrB, ptrC );

Their is the buildin variables to handle the index location 
threadIdx.x in the above example, this is something generated by 
the thread scheduler in the video card/apu device.

Generally calls to this setup has a very high latency, so using 
this for a small handful of items as in the above example makes 
no sense. In the above example that would end up using a single 
execution cluster, and leave you prey to the latency of the pcie 
bus, execution time, and latency costs of the video memory.

it doesn't get effective until you are working with large data 
sets, that can take advantage of a massive number of threads 
where the latency problems would be secondary to the sheer 
calculations done.

as far as D goes, we really only have one build in microthreading 
capable language construct, foreach.

However I don't think a library extension similar to 
std.parallelism would work gpu based microthreading.

foreach would need to have something to tell the compiler to 
generate gpu bytecode for the code block it uses, and would need 
instructions on when to use said code block based on dataset size.

while it is completely possible to have very little change with 
function just add new property @microthreaded and the build in 
variables for the index position/s, the calling syntax would need 
changes to support a work range or multidimensional range of some 
sort.

perhaps looking something like....

add$(1 .. 10)(ptrA,ptrB,ptrC);

a templated function looking similar

add!(float)$(1 .. 10)(ptrA,ptrB,ptrC);