GPGPUs
luminousone
rd.hunt at gmail.com
Sat Aug 17 19:46:08 PDT 2013
On Sunday, 18 August 2013 at 01:43:33 UTC, Atash wrote:
> Unified virtual address-space I can accept, fine. Ignoring that
> it is, in fact, in a totally different address-space where
> memory latency is *entirely different*, I'm far *far* more iffy
> about.
>
>> We basically have to follow these rules,
>>
>> 1. The range must be none prior to execution of a gpu code
>> block
>> 2. The range can not be changed during execution of a gpu code
>> block
>> 3. Code blocks can only receive a single range, it can however
>> be multidimensional
>> 4. index keys used in a code block are immutable
>> 5. Code blocks can only use a single key(the gpu executes many
>> instances in parallel each with their own unique key)
>> 6. index's are always an unsigned integer type
>> 7. openCL,CUDA have no access to global state
>> 8. gpu code blocks can not allocate memory
>> 9. gpu code blocks can not call cpu functions
>> 10. atomics tho available on the gpu are many times slower
>> then on the cpu
>> 11. separate running instances of the same code block on the
>> gpu can not have any interdependency on each other.
>
> Please explain point 1 (specifically the use of the word
> 'none'), and why you added in point 3?
>
> Additionally, point 11 doesn't make any sense to me. There is
> research out there showing how to use cooperative warp-scans,
> for example, to have multiple work-items cooperate over some
> local block of memory and perform sorting in blocks. There are
> even tutorials out there for OpenCL and CUDA that shows how to
> do this, specifically to create better performing code. This
> statement is in direct contradiction with what exists.
>
>> Now if we are talking about HSA, or other similar setup, then
>> a few of those rules don't apply or become fuzzy.
>>
>> HSA, does have limited access to global state, HSA can call
>> cpu functions that are pure, and of course because in HSA the
>> cpu and gpu share the same virtual address space most of
>> memory is open for access.
>>
>> HSA also manages memory, via the hMMU, and their is no need
>> for gpu memory management functions, as that is managed by the
>> operating system and video card drivers.
>
> Good for HSA. Now why are we latching onto this particular
> construction that, as far as I can tell, is missing the support
> of at least two highly relevant giants (Intel and NVidia)?
>
>> Basically, D would either need to opt out of legacy api's such
>> as openCL, CUDA, etc, these are mostly tied to c/c++ anyway,
>> and generally have ugly as sin syntax; or D would have go the
>> route of a full and safe gpu subset of features.
>
> Wrappers do a lot to change the appearance of a program. Raw
> OpenCL may look ugly, but so do BLAS and LAPACK routines. The
> use of wrappers and expression templates does a lot to clean up
> code (ex. look at the way Eigen 3 or any other linear algebra
> library does expression templates in C++; something D can do
> even better).
>
>> I don't think such a setup can be implemented as simply a
>> library, as the GPU needs compiled source.
>
> This doesn't make sense. Your claim is contingent on opting out
> of OpenCL or any other mechanism that provides for the
> application to carry abstract instructions which are then
> compiled on the fly. If you're okay with creating kernel code
> on the fly, this can be implemented as a library, beyond any
> reasonable doubt.
>
>> If D where to implement gpgpu features, I would actually
>> suggest starting by simply adding a microthreading function
>> syntax, for example...
>>
>> void example( aggregate in float a[] ; key , in float b[], out
>> float c[]) {
>> c[key] = a[key] + b[key];
>> }
>>
>> By adding an aggregate keyword to the function, we can assume
>> the range simply using the length of a[] without adding an
>> extra set of brackets or something similar.
>>
>> This would make access to the gpu more generic, and more
>> importantly, because llvm will support HSA, removes the needs
>> for writing more complex support into dmd as openCL and CUDA
>> would require, a few hints for the llvm backend would be
>> enough to generate the dual bytecode ELF executables.
>
> 1) If you wanted to have that 'key' nonsense in there, I'm
> thinking you'd need to add several additional parameters:
> global size, group size, group count, and maybe group-local
> memory access (requires allowing multiple aggregates?). I mean,
> I get the gist of what you're saying, this isn't me pointing
> out a problem, just trying to get a clarification on it (maybe
> give 'key' some additional structure, or something).
>
> 2) ... I kind of like this idea. I disagree with how you led up
> to it, but I like the idea.
>
> 3) How do you envision *calling* microthreaded code? Just the
> usual syntax?
>
> 4) How would this handle working on subranges?
>
> ex. Let's say I'm coding up a radix sort using something like
> this:
>
> https://sites.google.com/site/duanemerrill/PplGpuSortingPreprint.pdf?attredirects=0
>
> What's the high-level program organization with this syntax if
> we can only use one range at a time? How many work-items get
> fired off? What's the gpu-code launch procedure?
sorry typo, meant known.
More information about the Digitalmars-d
mailing list