GPGPUs

Fri Aug 16 13:07:31 PDT 2013

On Friday, 16 August 2013 at 19:55:56 UTC, luminousone wrote:
>> The core (!) point here is that processor chips are rapidly 
>> becoming a
>> collection of heterogeneous cores. Any programming language 
>> that assumes
>> a single CPU or a collection of homogeneous CPUs has built-in
>> obsolescence.
>>
>> So the question I am interested in is whether D is the 
>> language that can
>> allow me to express in a single codebase a program in which 
>> parts will
>> be executed on one or more GPGPUs and parts on multiple CPUs. 
>> D has
>> support for the latter, std.parallelism and std.concurrency.
>>
>> I guess my question is whether people are interested in 
>> std.gpgpu (or
>> some more sane name).
>
> CUDA, works as a preprocessor pass that generates c files from 
> .cu extension files.
>
> In effect, to create a sensible environment for microthreaded 
> programming, they extend the language.
>
> a basic CUDA function looking something like...
>
> __global__ void add( float * a, float * b, float * c) {
>    int i = threadIdx.x;
>    c[i] = a[i] + b[i];
> }
>
> add<<< 1, 10 >>>( ptrA, ptrB, ptrC );
>
> Their is the buildin variables to handle the index location 
> threadIdx.x in the above example, this is something generated 
> by the thread scheduler in the video card/apu device.
>
> Generally calls to this setup has a very high latency, so using 
> this for a small handful of items as in the above example makes 
> no sense. In the above example that would end up using a single 
> execution cluster, and leave you prey to the latency of the 
> pcie bus, execution time, and latency costs of the video memory.
>
> it doesn't get effective until you are working with large data 
> sets, that can take advantage of a massive number of threads 
> where the latency problems would be secondary to the sheer 
> calculations done.
>
> as far as D goes, we really only have one build in 
> microthreading capable language construct, foreach.
>
> However I don't think a library extension similar to 
> std.parallelism would work gpu based microthreading.
>
> foreach would need to have something to tell the compiler to 
> generate gpu bytecode for the code block it uses, and would 
> need instructions on when to use said code block based on 
> dataset size.
>
> while it is completely possible to have very little change with 
> function just add new property @microthreaded and the build in 
> variables for the index position/s, the calling syntax would 
> need changes to support a work range or multidimensional range 
> of some sort.
>
> perhaps looking something like....
>
> add$(1 .. 10)(ptrA,ptrB,ptrC);
>
> a templated function looking similar
>
> add!(float)$(1 .. 10)(ptrA,ptrB,ptrC);

We have a[] = b[] * c[] - 5; etc. which could work very neatly 
perhaps?