GPGPUs

Sat Aug 17 18:43:28 PDT 2013

Unified virtual address-space I can accept, fine. Ignoring that 
it is, in fact, in a totally different address-space where memory 
latency is *entirely different*, I'm far *far* more iffy about.

> We basically have to follow these rules,
>
> 1. The range must be none prior to execution of a gpu code block
> 2. The range can not be changed during execution of a gpu code 
> block
> 3. Code blocks can only receive a single range, it can however 
> be multidimensional
> 4. index keys used in a code block are immutable
> 5. Code blocks can only use a single key(the gpu executes many 
> instances in parallel each with their own unique key)
> 6. index's are always an unsigned integer type
> 7. openCL,CUDA have no access to global state
> 8. gpu code blocks can not allocate memory
> 9. gpu code blocks can not call cpu functions
> 10. atomics tho available on the gpu are many times slower then 
> on the cpu
> 11. separate running instances of the same code block on the 
> gpu can not have any interdependency on each other.

Please explain point 1 (specifically the use of the word 'none'), 
and why you added in point 3?

Additionally, point 11 doesn't make any sense to me. There is 
research out there showing how to use cooperative warp-scans, for 
example, to have multiple work-items cooperate over some local 
block of memory and perform sorting in blocks. There are even 
tutorials out there for OpenCL and CUDA that shows how to do 
this, specifically to create better performing code. This 
statement is in direct contradiction with what exists.

> Now if we are talking about HSA, or other similar setup, then a 
> few of those rules don't apply or become fuzzy.
>
> HSA, does have limited access to global state, HSA can call cpu 
> functions that are pure, and of course because in HSA the cpu 
> and gpu share the same virtual address space most of memory is 
> open for access.
>
> HSA also manages memory, via the hMMU, and their is no need for 
> gpu memory management functions, as that is managed by the 
> operating system and video card drivers.

Good for HSA. Now why are we latching onto this particular 
construction that, as far as I can tell, is missing the support 
of at least two highly relevant giants (Intel and NVidia)?

> Basically, D would either need to opt out of legacy api's such 
> as openCL, CUDA, etc, these are mostly tied to c/c++ anyway, 
> and generally have ugly as sin syntax; or D would have go the 
> route of a full and safe gpu subset of features.

Wrappers do a lot to change the appearance of a program. Raw 
OpenCL may look ugly, but so do BLAS and LAPACK routines. The use 
of wrappers and expression templates does a lot to clean up code 
(ex. look at the way Eigen 3 or any other linear algebra library 
does expression templates in C++; something D can do even better).

> I don't think such a setup can be implemented as simply a 
> library, as the GPU needs compiled source.

This doesn't make sense. Your claim is contingent on opting out 
of OpenCL or any other mechanism that provides for the 
application to carry abstract instructions which are then 
compiled on the fly. If you're okay with creating kernel code on 
the fly, this can be implemented as a library, beyond any 
reasonable doubt.

> If D where to implement gpgpu features, I would actually 
> suggest starting by simply adding a microthreading function 
> syntax, for example...
>
> void example( aggregate in float a[] ; key , in float b[], out 
> float c[]) {
> 	c[key] = a[key] + b[key];
> }
>
> By adding an aggregate keyword to the function, we can assume 
> the range simply using the length of a[] without adding an 
> extra set of brackets or something similar.
>
> This would make access to the gpu more generic, and more 
> importantly, because llvm will support HSA, removes the needs 
> for writing more complex support into dmd as openCL and CUDA 
> would require, a few hints for the llvm backend would be enough 
> to generate the dual bytecode ELF executables.

1) If you wanted to have that 'key' nonsense in there, I'm 
thinking you'd need to add several additional parameters: global 
size, group size, group count, and maybe group-local memory 
access (requires allowing multiple aggregates?). I mean, I get 
the gist of what you're saying, this isn't me pointing out a 
problem, just trying to get a clarification on it (maybe give 
'key' some additional structure, or something).

2) ... I kind of like this idea. I disagree with how you led up 
to it, but I like the idea.

3) How do you envision *calling* microthreaded code? Just the 
usual syntax?

4) How would this handle working on subranges?

ex. Let's say I'm coding up a radix sort using something like 
this:

https://sites.google.com/site/duanemerrill/PplGpuSortingPreprint.pdf?attredirects=0

What's the high-level program organization with this syntax if we 
can only use one range at a time? How many work-items get fired 
off? What's the gpu-code launch procedure?