GPGPUs

Sat Aug 17 17:37:08 PDT 2013

We basically have to follow these rules,

1. The range must be none prior to execution of a gpu code block
2. The range can not be changed during execution of a gpu code 
block
3. Code blocks can only receive a single range, it can however be 
multidimensional
4. index keys used in a code block are immutable
5. Code blocks can only use a single key(the gpu executes many 
instances in parallel each with their own unique key)
6. index's are always an unsigned integer type
7. openCL,CUDA have no access to global state
8. gpu code blocks can not allocate memory
9. gpu code blocks can not call cpu functions
10. atomics tho available on the gpu are many times slower then 
on the cpu
11. separate running instances of the same code block on the gpu 
can not have any interdependency on each other.

Now if we are talking about HSA, or other similar setup, then a 
few of those rules don't apply or become fuzzy.

HSA, does have limited access to global state, HSA can call cpu 
functions that are pure, and of course because in HSA the cpu and 
gpu share the same virtual address space most of memory is open 
for access.

HSA also manages memory, via the hMMU, and their is no need for 
gpu memory management functions, as that is managed by the 
operating system and video card drivers.

Basically, D would either need to opt out of legacy api's such as 
openCL, CUDA, etc, these are mostly tied to c/c++ anyway, and 
generally have ugly as sin syntax; or D would have go the route 
of a full and safe gpu subset of features.

I don't think such a setup can be implemented as simply a 
library, as the GPU needs compiled source.

If D where to implement gpgpu features, I would actually suggest 
starting by simply adding a microthreading function syntax, for 
example...

void example( aggregate in float a[] ; key , in float b[], out 
float c[]) {
	c[key] = a[key] + b[key];
}

By adding an aggregate keyword to the function, we can assume the 
range simply using the length of a[] without adding an extra set 
of brackets or something similar.

This would make access to the gpu more generic, and more 
importantly, because llvm will support HSA, removes the needs for 
writing more complex support into dmd as openCL and CUDA would 
require, a few hints for the llvm backend would be enough to 
generate the dual bytecode ELF executables.