D and GPGPU

Wed Feb 18 10:10:52 PST 2015

On Wednesday, 18 February 2015 at 15:15:21 UTC, Russel Winder 
wrote:
>
> The issue is to create a GPGPU kernel (usually C code with 
> bizarre data
> structures and calling conventions) set it running and then 
> pipe data in
> and collect data out – currently very slow but the next 
> generation of
> Intel chips will fix this (*). And then there is the 
> OpenCL/CUDA debate.
>
> Personally I think OpenCL, for all it's deficiencies, as it is 
> vendor
> neutral. CUDA binds you to NVIDIA. Anyway there is an NVIDIA 
> back end
> for OpenCL. With a system like PyOpenCL, the infrastructure 
> data and
> process handling is abstracted, but you still have to write the 
> kernels
> in C. They really ought to do a Python DSL for that, but… So 
> with D can
> we write D kernels and have them compiled and loaded using a 
> combination
> of CTFE, D → C translation, C ompiler call, and other magic?

I'd like to about the kernel languages (having done both OpenCL 
and CUDA).

A big speed-up factor is the multiple level of parallelism 
exposed in OpenCL C and CUDA C:

- contect parallelism (eg. several GPU)
- command parallelism (based on a future model)
- block parallelism
- warp/sub-block parallelism
- in each sub-block, N threads (typically 32 or 64)

All of that supported by appropriate barrier semantics. Typical 
C-like code only has threads as parallelism and a less complex 
cache.

Also most algorithms don't translate all that well to SIMD 
threads working in lockstep.

Example: instead of looping on that 2D image and perform an 
horizontal blur on 15 pixels, instead perform this operation on 
32x16 blocks simultaneously, while caching stuff in block-local 
memory.

It is much like an auto-vectorization problem and 
auto-vectorization is hard.