GPGPUs

Sat Aug 17 23:22:25 PDT 2013

On Sunday, 18 August 2013 at 05:05:48 UTC, Atash wrote:
> On Sunday, 18 August 2013 at 03:55:58 UTC, luminousone wrote:
>> You do have limited Atomics, but you don't really have any 
>> sort of complex messages, or anything like that.
>
> I said 'point 11', not 'point 10'. You also dodged points 1 and 
> 3...
>
>> Intel doesn't have a dog in this race, so their is no way to 
>> know what they plan on doing if anything at all.
>
> http://software.intel.com/en-us/vcsource/tools/opencl-sdk
> http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html
>
> Just based on those, I'm pretty certain they 'have a dog in 
> this race'. The dog happens to be running with MPI and OpenCL 
> across a bridge made of PCIe.

The Xeon Phi is interesting in so far as taking generic 
programming to a more parallel environment. However it has some 
serious limitations that will heavily damage its potential 
performance.

AVX2 is completely the wrong path to go about improving 
performance in parallel computing, The SIMD nature of this 
instruction set means that scalar operations, or even just not 
being able to fill the giant 256/512bit register wastes huge 
chunks of this things peak theoretical performance, and if any 
rules apply to instruction pairing on this multi issue pipeline 
you have yet more potential for wasted cycles.

I haven't seen anything about intels, micro thread scheduler, or 
how these chips handle mass context switching natural of micro 
threaded environments, These two items make a huge difference in 
performance, comparing radeon VLIW5/4 to radeon GCN is a good 
example, most of the performance benefit of GCN is from easy of 
scheduling scalar pipelines over more complex pipes with 
instruction pairing rules etc.

Frankly Intel, has some cool stuff, but they have been caught 
with their pants down, they have depended on their large fab 
advantage to carry them over and got lazy.

We likely are watching AMD64 all over again.

>> The reason to point out HSA, is because it is really easy add 
>> support for, it is not a giant task like opencl would be. A 
>> few changes to the front end compiler is all that is needed, 
>> LLVM's backend does the rest.
>
> H'okay. I can accept that.
>
>> OpenCL isn't just a library, it is a language extension, that 
>> is ran through a preprocessor that compiles the embedded 
>> __KERNEL and __DEVICE functions, into usable code, and then 
>> outputs .c/.cpp files for the c compiler to deal with.
>
> But all those extra bits are part of the computing 
> *environment*. Is there something wrong with requiring the 
> proper environment for an executable?
>
> A more objective question: which devices are you trying to 
> target here?

A first, simply a different way of approaching std.parallel like 
functionality, with an eye gpgpu in the future when easy 
integration solutions popup(such as HSA).

>> Those are all platform specific, they change based on the whim 
>> and fancy of NVIDIA and AMD with each and every new chip 
>> released, The size and configuration of CUDA clusters, or 
>> compute clusters, or EU's, or whatever the hell x chip maker 
>> feels like using at the moment.
>>
>> Long term this will all be managed by the underlying support 
>> software in the video drivers, and operating system kernel. 
>> Putting any effort into this is a waste of time.
>
> Yes. And the only way to optimize around them is to *know 
> them*, otherwise you're pinning the developer down the same way 
> OpenMP does. Actually, even worse than the way OpenMP does - at 
> least OpenMP lets you set some hints about how many threads you 
> want.

It would be best to wait for a more generic software platform, to 
find out how this is handled by the next generation of micro 
threading tools.

The way openCL/CUDA work reminds me to much of someone setting up 
tomcat to have java code generate php that runs on their apache 
server, just because they can. I would rather tighter integration 
with the core language, then having a language in language.

>> void example( aggregate in float a[] ; key , in float b[], out
>>   float c[]) {
>> 	c[key] = a[key] + b[key];
>> }
>>
>> example(a,b,c);
>>
>> in the function declaration you can think of the aggregate 
>> basically having the reserve order of the items in a foreach 
>> statement.
>>
>> int a[100] = [ ... ];
>> int b[100];
>> foreach( v, k ; a ) { b = a[k]; }
>>
>> int a[100] = [ ... ];
>> int b[100];
>>
>> void example2( aggregate in float A[] ; k, out float B[] ) { 
>> B[k] = A[k]; }
>>
>> example2(a,b);
>
> Contextually solid. Read my response to the next bit.
>
>> I am pretty sure they are simply multiplying the index value 
>> by the unit size they desire to work on
>>
>> int a[100] = [ ... ];
>> int b[100];
>> void example3( aggregate in range r ; k, in float a[], float 
>> b[]){
>>   b[k]   = a[k];
>>   b[k+1] = a[k+1];
>> }
>>
>> example3( 0 .. 50 , a,b);
>>
>> Then likely they are simply executing multiple __KERNEL 
>> functions in sequence, would be my guess.
>
> I've implemented this algorithm before in OpenCL already, and 
> what you're saying so far doesn't rhyme with what's needed.
>
> There are at least two ranges, one keeping track of partial 
> summations, the other holding the partial sorts. Three separate 
> kernels are ran in cycles to reduce over and scatter the data. 
> The way early exit is implemented isn't mentioned as part of 
> the implementation details, but my implementation of the 
> strategy requires a third range to act as a flag array to be 
> reduced over and read in between kernel invocations.
>
> It isn't just unit size multiplication - there's communication 
> between work-items and *exquisitely* arranged local-group 
> reductions and scans (so-called 'warpscans') that take 
> advantage of the widely accepted concept of a local group of 
> work-items (a parameter you explicitly disregarded) and their 
> shared memory pool. The entire point of the paper is that it's 
> possible to come up with a general algorithm that can be 
> parameterized to fit individual GPU configurations if desired. 
> This kind of algorithm provides opportunities for tuning... 
> which seem to be lost, unnecessarily, in what I've read so far 
> in your descriptions.
>
> My point being, I don't like where this is going by treating 
> coprocessors, which have so far been very *very* different from 
> one another, as the same batch of *whatever*. I also don't like 
> that it's ignoring NVidia, and ignoring Intel's push for 
> general-purpose accelerators such as their Xeon Phi.
>
> But, meh, if HSA is so easy, then it's low-hanging fruit, so 
> whatever, go ahead and push for it.
>
> === REMINDER OF RELEVANT STUFF FURTHER UP IN THE POST:
>
> "A more objective question: which devices are you trying to 
> target here?"
>
> === AND SOMETHING ELSE:
>
> I feel like we're just on different wavelengths. At what level 
> do you imagine having this support, in terms of support for 
> doing low-level things? Is this something like OpenMP, where 
> threading and such are done at a really (really really 
> really...) high level, or what?

Low level optimization is a wonderful thing, But I almost wonder 
if this will always be something where in order todo the low 
level optimization you will be using the vendors provided 
platform for doing it, as no generic tool will be able to match 
the custom one.

Most of my interaction with the gpu is via shader programs for 
Opengl, I have only lightly used CUDA for some image processing 
software, So I am certainly not the one to give in depth detail 
to optimization strategies.

sorry on point 1, that was a typo, I meant

1. The range must be known prior to execution of a gpu code block.

as for

3. Code blocks can only receive a single range, it can however be 
multidimensional

int a[100] = [ ... ];
int b[100];
void example3( aggregate in range r ; k, in float a[], float
   b[]){
   b[k]   = a[k];
}
example3( 0 .. 100 , a,b);

This function would be executed 100 times.

int a[10_000] = [ ... ];
int b[10_000];
void example3( aggregate in range r ; kx,aggregate in range r2 ; 
ky, in float a[], float
   b[]){
   b[kx+(ky*100)]   = a[kx+(ky*100)];
}
example3( 0 .. 100 , 0 .. 100 , a,b);

this function would be executed 10,000 times. the two aggregate 
ranges being treated as a single 2 dimensional range.

Maybe a better description of the rule would be that multiple 
ranges are multiplicative, and functionally operate as a single 
range.