GPGPUs

Sat Aug 17 19:46:08 PDT 2013

On Sunday, 18 August 2013 at 01:43:33 UTC, Atash wrote:
> Unified virtual address-space I can accept, fine. Ignoring that 
> it is, in fact, in a totally different address-space where 
> memory latency is *entirely different*, I'm far *far* more iffy 
> about.
>
>> We basically have to follow these rules,
>>
>> 1. The range must be none prior to execution of a gpu code 
>> block
>> 2. The range can not be changed during execution of a gpu code 
>> block
>> 3. Code blocks can only receive a single range, it can however 
>> be multidimensional
>> 4. index keys used in a code block are immutable
>> 5. Code blocks can only use a single key(the gpu executes many 
>> instances in parallel each with their own unique key)
>> 6. index's are always an unsigned integer type
>> 7. openCL,CUDA have no access to global state
>> 8. gpu code blocks can not allocate memory
>> 9. gpu code blocks can not call cpu functions
>> 10. atomics tho available on the gpu are many times slower 
>> then on the cpu
>> 11. separate running instances of the same code block on the 
>> gpu can not have any interdependency on each other.
>
> Please explain point 1 (specifically the use of the word 
> 'none'), and why you added in point 3?
>
> Additionally, point 11 doesn't make any sense to me. There is 
> research out there showing how to use cooperative warp-scans, 
> for example, to have multiple work-items cooperate over some 
> local block of memory and perform sorting in blocks. There are 
> even tutorials out there for OpenCL and CUDA that shows how to 
> do this, specifically to create better performing code. This 
> statement is in direct contradiction with what exists.
>
>> Now if we are talking about HSA, or other similar setup, then 
>> a few of those rules don't apply or become fuzzy.
>>
>> HSA, does have limited access to global state, HSA can call 
>> cpu functions that are pure, and of course because in HSA the 
>> cpu and gpu share the same virtual address space most of 
>> memory is open for access.
>>
>> HSA also manages memory, via the hMMU, and their is no need 
>> for gpu memory management functions, as that is managed by the 
>> operating system and video card drivers.
>
> Good for HSA. Now why are we latching onto this particular 
> construction that, as far as I can tell, is missing the support 
> of at least two highly relevant giants (Intel and NVidia)?
>
>> Basically, D would either need to opt out of legacy api's such 
>> as openCL, CUDA, etc, these are mostly tied to c/c++ anyway, 
>> and generally have ugly as sin syntax; or D would have go the 
>> route of a full and safe gpu subset of features.
>
> Wrappers do a lot to change the appearance of a program. Raw 
> OpenCL may look ugly, but so do BLAS and LAPACK routines. The 
> use of wrappers and expression templates does a lot to clean up 
> code (ex. look at the way Eigen 3 or any other linear algebra 
> library does expression templates in C++; something D can do 
> even better).
>
>> I don't think such a setup can be implemented as simply a 
>> library, as the GPU needs compiled source.
>
> This doesn't make sense. Your claim is contingent on opting out 
> of OpenCL or any other mechanism that provides for the 
> application to carry abstract instructions which are then 
> compiled on the fly. If you're okay with creating kernel code 
> on the fly, this can be implemented as a library, beyond any 
> reasonable doubt.
>
>> If D where to implement gpgpu features, I would actually 
>> suggest starting by simply adding a microthreading function 
>> syntax, for example...
>>
>> void example( aggregate in float a[] ; key , in float b[], out 
>> float c[]) {
>> 	c[key] = a[key] + b[key];
>> }
>>
>> By adding an aggregate keyword to the function, we can assume 
>> the range simply using the length of a[] without adding an 
>> extra set of brackets or something similar.
>>
>> This would make access to the gpu more generic, and more 
>> importantly, because llvm will support HSA, removes the needs 
>> for writing more complex support into dmd as openCL and CUDA 
>> would require, a few hints for the llvm backend would be 
>> enough to generate the dual bytecode ELF executables.
>
> 1) If you wanted to have that 'key' nonsense in there, I'm 
> thinking you'd need to add several additional parameters: 
> global size, group size, group count, and maybe group-local 
> memory access (requires allowing multiple aggregates?). I mean, 
> I get the gist of what you're saying, this isn't me pointing 
> out a problem, just trying to get a clarification on it (maybe 
> give 'key' some additional structure, or something).
>
> 2) ... I kind of like this idea. I disagree with how you led up 
> to it, but I like the idea.
>
> 3) How do you envision *calling* microthreaded code? Just the 
> usual syntax?
>
> 4) How would this handle working on subranges?
>
> ex. Let's say I'm coding up a radix sort using something like 
> this:
>
> https://sites.google.com/site/duanemerrill/PplGpuSortingPreprint.pdf?attredirects=0
>
> What's the high-level program organization with this syntax if 
> we can only use one range at a time? How many work-items get 
> fired off? What's the gpu-code launch procedure?

sorry typo, meant known.