GPGPUs
luminousone
rd.hunt at gmail.com
Sat Aug 17 17:37:08 PDT 2013
We basically have to follow these rules,
1. The range must be none prior to execution of a gpu code block
2. The range can not be changed during execution of a gpu code
block
3. Code blocks can only receive a single range, it can however be
multidimensional
4. index keys used in a code block are immutable
5. Code blocks can only use a single key(the gpu executes many
instances in parallel each with their own unique key)
6. index's are always an unsigned integer type
7. openCL,CUDA have no access to global state
8. gpu code blocks can not allocate memory
9. gpu code blocks can not call cpu functions
10. atomics tho available on the gpu are many times slower then
on the cpu
11. separate running instances of the same code block on the gpu
can not have any interdependency on each other.
Now if we are talking about HSA, or other similar setup, then a
few of those rules don't apply or become fuzzy.
HSA, does have limited access to global state, HSA can call cpu
functions that are pure, and of course because in HSA the cpu and
gpu share the same virtual address space most of memory is open
for access.
HSA also manages memory, via the hMMU, and their is no need for
gpu memory management functions, as that is managed by the
operating system and video card drivers.
Basically, D would either need to opt out of legacy api's such as
openCL, CUDA, etc, these are mostly tied to c/c++ anyway, and
generally have ugly as sin syntax; or D would have go the route
of a full and safe gpu subset of features.
I don't think such a setup can be implemented as simply a
library, as the GPU needs compiled source.
If D where to implement gpgpu features, I would actually suggest
starting by simply adding a microthreading function syntax, for
example...
void example( aggregate in float a[] ; key , in float b[], out
float c[]) {
c[key] = a[key] + b[key];
}
By adding an aggregate keyword to the function, we can assume the
range simply using the length of a[] without adding an extra set
of brackets or something similar.
This would make access to the gpu more generic, and more
importantly, because llvm will support HSA, removes the needs for
writing more complex support into dmd as openCL and CUDA would
require, a few hints for the llvm backend would be enough to
generate the dual bytecode ELF executables.
More information about the Digitalmars-d
mailing list