GPGPUs

Sun Aug 18 00:28:01 PDT 2013

On Sunday, 18 August 2013 at 06:22:30 UTC, luminousone wrote:
> The Xeon Phi is interesting in so far as taking generic 
> programming to a more parallel environment. However it has some 
> serious limitations that will heavily damage its potential 
> performance.
>
> AVX2 is completely the wrong path to go about improving 
> performance in parallel computing, The SIMD nature of this 
> instruction set means that scalar operations, or even just not 
> being able to fill the giant 256/512bit register wastes huge 
> chunks of this things peak theoretical performance, and if any 
> rules apply to instruction pairing on this multi issue pipeline 
> you have yet more potential for wasted cycles.
>
> I haven't seen anything about intels, micro thread scheduler, 
> or how these chips handle mass context switching natural of 
> micro threaded environments, These two items make a huge 
> difference in performance, comparing radeon VLIW5/4 to radeon 
> GCN is a good example, most of the performance benefit of GCN 
> is from easy of scheduling scalar pipelines over more complex 
> pipes with instruction pairing rules etc.
>
> Frankly Intel, has some cool stuff, but they have been caught 
> with their pants down, they have depended on their large fab 
> advantage to carry them over and got lazy.
>
> We likely are watching AMD64 all over again.

Well, I can't argue that one.

> A first, simply a different way of approaching std.parallel 
> like functionality, with an eye gpgpu in the future when easy 
> integration solutions popup(such as HSA).

I can't argue with that either.

> It would be best to wait for a more generic software platform, 
> to find out how this is handled by the next generation of micro 
> threading tools.
>
> The way openCL/CUDA work reminds me to much of someone setting 
> up tomcat to have java code generate php that runs on their 
> apache server, just because they can. I would rather tighter 
> integration with the core language, then having a language in 
> language.

Fair point. I have my own share of idyllic wants, so I can't 
argue with those.

> Low level optimization is a wonderful thing, But I almost 
> wonder if this will always be something where in order todo the 
> low level optimization you will be using the vendors provided 
> platform for doing it, as no generic tool will be able to match 
> the custom one.

But OpenCL is by no means a 'custom tool'. CUDA, maybe, but 
OpenCL just doesn't fit the bill in my opinion. I can see it 
being possible in the future that it'd be considered 'low-level', 
but it's a fairly generic solution. A little hackneyed under your 
earlier metaphors, but still a generic, standard solution.

> Most of my interaction with the gpu is via shader programs for 
> Opengl, I have only lightly used CUDA for some image processing 
> software, So I am certainly not the one to give in depth detail 
> to optimization strategies.

There was a *lot* of stuff that opened up when vendors dumped 
GPGPU out of Pandora's box. If you want to get a feel for some 
optimization strategies and what they require, check this site 
out: http://www.bealto.com/gpu-sorting_intro.html (and I hope I'm 
not insulting your intelligence here, if I am, I truly apologize).

> sorry on point 1, that was a typo, I meant
>
> 1. The range must be known prior to execution of a gpu code 
> block.
>
> as for
>
> 3. Code blocks can only receive a single range, it can however 
> be multidimensional
>
> int a[100] = [ ... ];
> int b[100];
> void example3( aggregate in range r ; k, in float a[], float
>   b[]){
>   b[k]   = a[k];
> }
> example3( 0 .. 100 , a,b);
>
> This function would be executed 100 times.
>
> int a[10_000] = [ ... ];
> int b[10_000];
> void example3( aggregate in range r ; kx,aggregate in range r2 
> ; ky, in float a[], float
>   b[]){
>   b[kx+(ky*100)]   = a[kx+(ky*100)];
> }
> example3( 0 .. 100 , 0 .. 100 , a,b);
>
> this function would be executed 10,000 times. the two aggregate 
> ranges being treated as a single 2 dimensional range.
>
> Maybe a better description of the rule would be that multiple 
> ranges are multiplicative, and functionally operate as a single 
> range.

OH.

I think I was totally misunderstanding you earlier. The 
'aggregate' is the range over the *problem space*, not the values 
being punched into the problem. Is this true or false?

(if true I'm about to feel incredibly sheepish)