D and GPGPU
ponce via Digitalmars-d
digitalmars-d at puremagic.com
Wed Feb 18 10:10:52 PST 2015
On Wednesday, 18 February 2015 at 15:15:21 UTC, Russel Winder
wrote:
>
> The issue is to create a GPGPU kernel (usually C code with
> bizarre data
> structures and calling conventions) set it running and then
> pipe data in
> and collect data out – currently very slow but the next
> generation of
> Intel chips will fix this (*). And then there is the
> OpenCL/CUDA debate.
>
> Personally I think OpenCL, for all it's deficiencies, as it is
> vendor
> neutral. CUDA binds you to NVIDIA. Anyway there is an NVIDIA
> back end
> for OpenCL. With a system like PyOpenCL, the infrastructure
> data and
> process handling is abstracted, but you still have to write the
> kernels
> in C. They really ought to do a Python DSL for that, but… So
> with D can
> we write D kernels and have them compiled and loaded using a
> combination
> of CTFE, D → C translation, C ompiler call, and other magic?
I'd like to about the kernel languages (having done both OpenCL
and CUDA).
A big speed-up factor is the multiple level of parallelism
exposed in OpenCL C and CUDA C:
- contect parallelism (eg. several GPU)
- command parallelism (based on a future model)
- block parallelism
- warp/sub-block parallelism
- in each sub-block, N threads (typically 32 or 64)
All of that supported by appropriate barrier semantics. Typical
C-like code only has threads as parallelism and a less complex
cache.
Also most algorithms don't translate all that well to SIMD
threads working in lockstep.
Example: instead of looping on that 2D image and perform an
horizontal blur on 15 pixels, instead perform this operation on
32x16 blocks simultaneously, while caching stuff in block-local
memory.
It is much like an auto-vectorization problem and
auto-vectorization is hard.
More information about the Digitalmars-d
mailing list