GPGPUs

Sun Aug 18 01:08:33 PDT 2013

On Sunday, 18 August 2013 at 07:28:02 UTC, Atash wrote:
> On Sunday, 18 August 2013 at 06:22:30 UTC, luminousone wrote:
>> The Xeon Phi is interesting in so far as taking generic 
>> programming to a more parallel environment. However it has 
>> some serious limitations that will heavily damage its 
>> potential performance.
>>
>> AVX2 is completely the wrong path to go about improving 
>> performance in parallel computing, The SIMD nature of this 
>> instruction set means that scalar operations, or even just not 
>> being able to fill the giant 256/512bit register wastes huge 
>> chunks of this things peak theoretical performance, and if any 
>> rules apply to instruction pairing on this multi issue 
>> pipeline you have yet more potential for wasted cycles.
>>
>> I haven't seen anything about intels, micro thread scheduler, 
>> or how these chips handle mass context switching natural of 
>> micro threaded environments, These two items make a huge 
>> difference in performance, comparing radeon VLIW5/4 to radeon 
>> GCN is a good example, most of the performance benefit of GCN 
>> is from easy of scheduling scalar pipelines over more complex 
>> pipes with instruction pairing rules etc.
>>
>> Frankly Intel, has some cool stuff, but they have been caught 
>> with their pants down, they have depended on their large fab 
>> advantage to carry them over and got lazy.
>>
>> We likely are watching AMD64 all over again.
>
> Well, I can't argue that one.
>
>> A first, simply a different way of approaching std.parallel 
>> like functionality, with an eye gpgpu in the future when easy 
>> integration solutions popup(such as HSA).
>
> I can't argue with that either.
>
>> It would be best to wait for a more generic software platform, 
>> to find out how this is handled by the next generation of 
>> micro threading tools.
>>
>> The way openCL/CUDA work reminds me to much of someone setting 
>> up tomcat to have java code generate php that runs on their 
>> apache server, just because they can. I would rather tighter 
>> integration with the core language, then having a language in 
>> language.
>
> Fair point. I have my own share of idyllic wants, so I can't 
> argue with those.
>
>> Low level optimization is a wonderful thing, But I almost 
>> wonder if this will always be something where in order todo 
>> the low level optimization you will be using the vendors 
>> provided platform for doing it, as no generic tool will be 
>> able to match the custom one.
>
> But OpenCL is by no means a 'custom tool'. CUDA, maybe, but 
> OpenCL just doesn't fit the bill in my opinion. I can see it 
> being possible in the future that it'd be considered 
> 'low-level', but it's a fairly generic solution. A little 
> hackneyed under your earlier metaphors, but still a generic, 
> standard solution.

I can agree with that.

>> Most of my interaction with the gpu is via shader programs for 
>> Opengl, I have only lightly used CUDA for some image 
>> processing software, So I am certainly not the one to give in 
>> depth detail to optimization strategies.
>
> There was a *lot* of stuff that opened up when vendors dumped 
> GPGPU out of Pandora's box. If you want to get a feel for some 
> optimization strategies and what they require, check this site 
> out: http://www.bealto.com/gpu-sorting_intro.html (and I hope 
> I'm not insulting your intelligence here, if I am, I truly 
> apologize).

I am still learning and additional links to go over never hurt!, 
I am of the opinion that a good programmers has never finished 
learning new stuff.

>> sorry on point 1, that was a typo, I meant
>>
>> 1. The range must be known prior to execution of a gpu code 
>> block.
>>
>> as for
>>
>> 3. Code blocks can only receive a single range, it can however 
>> be multidimensional
>>
>> int a[100] = [ ... ];
>> int b[100];
>> void example3( aggregate in range r ; k, in float a[], float
>>  b[]){
>>  b[k]   = a[k];
>> }
>> example3( 0 .. 100 , a,b);
>>
>> This function would be executed 100 times.
>>
>> int a[10_000] = [ ... ];
>> int b[10_000];
>> void example3( aggregate in range r ; kx,aggregate in range r2 
>> ; ky, in float a[], float
>>  b[]){
>>  b[kx+(ky*100)]   = a[kx+(ky*100)];
>> }
>> example3( 0 .. 100 , 0 .. 100 , a,b);
>>
>> this function would be executed 10,000 times. the two 
>> aggregate ranges being treated as a single 2 dimensional range.
>>
>> Maybe a better description of the rule would be that multiple 
>> ranges are multiplicative, and functionally operate as a 
>> single range.
>
> OH.
>
> I think I was totally misunderstanding you earlier. The 
> 'aggregate' is the range over the *problem space*, not the 
> values being punched into the problem. Is this true or false?
>
> (if true I'm about to feel incredibly sheepish)

Is "problem space" the correct industry term, I am self taught on 
much of this, so I on occasion I miss out on what the correct 
terminology for something is.

But yes that is what i meant.