GPGPUs

Sat Aug 17 22:05:47 PDT 2013

On Sunday, 18 August 2013 at 03:55:58 UTC, luminousone wrote:
> You do have limited Atomics, but you don't really have any sort 
> of complex messages, or anything like that.

I said 'point 11', not 'point 10'. You also dodged points 1 and 
3...

> Intel doesn't have a dog in this race, so their is no way to 
> know what they plan on doing if anything at all.

http://software.intel.com/en-us/vcsource/tools/opencl-sdk
http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html

Just based on those, I'm pretty certain they 'have a dog in this 
race'. The dog happens to be running with MPI and OpenCL across a 
bridge made of PCIe.

> The reason to point out HSA, is because it is really easy add 
> support for, it is not a giant task like opencl would be. A few 
> changes to the front end compiler is all that is needed, LLVM's 
> backend does the rest.

H'okay. I can accept that.

> OpenCL isn't just a library, it is a language extension, that 
> is ran through a preprocessor that compiles the embedded 
> __KERNEL and __DEVICE functions, into usable code, and then 
> outputs .c/.cpp files for the c compiler to deal with.

But all those extra bits are part of the computing *environment*. 
Is there something wrong with requiring the proper environment 
for an executable?

A more objective question: which devices are you trying to target 
here?

> Those are all platform specific, they change based on the whim 
> and fancy of NVIDIA and AMD with each and every new chip 
> released, The size and configuration of CUDA clusters, or 
> compute clusters, or EU's, or whatever the hell x chip maker 
> feels like using at the moment.
>
> Long term this will all be managed by the underlying support 
> software in the video drivers, and operating system kernel. 
> Putting any effort into this is a waste of time.

Yes. And the only way to optimize around them is to *know them*, 
otherwise you're pinning the developer down the same way OpenMP 
does. Actually, even worse than the way OpenMP does - at least 
OpenMP lets you set some hints about how many threads you want.

> void example( aggregate in float a[] ; key , in float b[], out
>    float c[]) {
> 	c[key] = a[key] + b[key];
> }
>
> example(a,b,c);
>
> in the function declaration you can think of the aggregate 
> basically having the reserve order of the items in a foreach 
> statement.
>
> int a[100] = [ ... ];
> int b[100];
> foreach( v, k ; a ) { b = a[k]; }
>
> int a[100] = [ ... ];
> int b[100];
>
> void example2( aggregate in float A[] ; k, out float B[] ) { 
> B[k] = A[k]; }
>
> example2(a,b);

Contextually solid. Read my response to the next bit.

> I am pretty sure they are simply multiplying the index value by 
> the unit size they desire to work on
>
> int a[100] = [ ... ];
> int b[100];
> void example3( aggregate in range r ; k, in float a[], float 
> b[]){
>    b[k]   = a[k];
>    b[k+1] = a[k+1];
> }
>
> example3( 0 .. 50 , a,b);
>
> Then likely they are simply executing multiple __KERNEL 
> functions in sequence, would be my guess.

I've implemented this algorithm before in OpenCL already, and 
what you're saying so far doesn't rhyme with what's needed.

There are at least two ranges, one keeping track of partial 
summations, the other holding the partial sorts. Three separate 
kernels are ran in cycles to reduce over and scatter the data. 
The way early exit is implemented isn't mentioned as part of the 
implementation details, but my implementation of the strategy 
requires a third range to act as a flag array to be reduced over 
and read in between kernel invocations.

It isn't just unit size multiplication - there's communication 
between work-items and *exquisitely* arranged local-group 
reductions and scans (so-called 'warpscans') that take advantage 
of the widely accepted concept of a local group of work-items (a 
parameter you explicitly disregarded) and their shared memory 
pool. The entire point of the paper is that it's possible to come 
up with a general algorithm that can be parameterized to fit 
individual GPU configurations if desired. This kind of algorithm 
provides opportunities for tuning... which seem to be lost, 
unnecessarily, in what I've read so far in your descriptions.

My point being, I don't like where this is going by treating 
coprocessors, which have so far been very *very* different from 
one another, as the same batch of *whatever*. I also don't like 
that it's ignoring NVidia, and ignoring Intel's push for 
general-purpose accelerators such as their Xeon Phi.

But, meh, if HSA is so easy, then it's low-hanging fruit, so 
whatever, go ahead and push for it.

=== REMINDER OF RELEVANT STUFF FURTHER UP IN THE POST:

"A more objective question: which devices are you trying to 
target here?"

=== AND SOMETHING ELSE:

I feel like we're just on different wavelengths. At what level do 
you imagine having this support, in terms of support for doing 
low-level things? Is this something like OpenMP, where threading 
and such are done at a really (really really really...) high 
level, or what?