GPGPUs
Atash
nope at nope.nope
Sun Aug 18 00:28:01 PDT 2013
On Sunday, 18 August 2013 at 06:22:30 UTC, luminousone wrote:
> The Xeon Phi is interesting in so far as taking generic
> programming to a more parallel environment. However it has some
> serious limitations that will heavily damage its potential
> performance.
>
> AVX2 is completely the wrong path to go about improving
> performance in parallel computing, The SIMD nature of this
> instruction set means that scalar operations, or even just not
> being able to fill the giant 256/512bit register wastes huge
> chunks of this things peak theoretical performance, and if any
> rules apply to instruction pairing on this multi issue pipeline
> you have yet more potential for wasted cycles.
>
> I haven't seen anything about intels, micro thread scheduler,
> or how these chips handle mass context switching natural of
> micro threaded environments, These two items make a huge
> difference in performance, comparing radeon VLIW5/4 to radeon
> GCN is a good example, most of the performance benefit of GCN
> is from easy of scheduling scalar pipelines over more complex
> pipes with instruction pairing rules etc.
>
> Frankly Intel, has some cool stuff, but they have been caught
> with their pants down, they have depended on their large fab
> advantage to carry them over and got lazy.
>
> We likely are watching AMD64 all over again.
Well, I can't argue that one.
> A first, simply a different way of approaching std.parallel
> like functionality, with an eye gpgpu in the future when easy
> integration solutions popup(such as HSA).
I can't argue with that either.
> It would be best to wait for a more generic software platform,
> to find out how this is handled by the next generation of micro
> threading tools.
>
> The way openCL/CUDA work reminds me to much of someone setting
> up tomcat to have java code generate php that runs on their
> apache server, just because they can. I would rather tighter
> integration with the core language, then having a language in
> language.
Fair point. I have my own share of idyllic wants, so I can't
argue with those.
> Low level optimization is a wonderful thing, But I almost
> wonder if this will always be something where in order todo the
> low level optimization you will be using the vendors provided
> platform for doing it, as no generic tool will be able to match
> the custom one.
But OpenCL is by no means a 'custom tool'. CUDA, maybe, but
OpenCL just doesn't fit the bill in my opinion. I can see it
being possible in the future that it'd be considered 'low-level',
but it's a fairly generic solution. A little hackneyed under your
earlier metaphors, but still a generic, standard solution.
> Most of my interaction with the gpu is via shader programs for
> Opengl, I have only lightly used CUDA for some image processing
> software, So I am certainly not the one to give in depth detail
> to optimization strategies.
There was a *lot* of stuff that opened up when vendors dumped
GPGPU out of Pandora's box. If you want to get a feel for some
optimization strategies and what they require, check this site
out: http://www.bealto.com/gpu-sorting_intro.html (and I hope I'm
not insulting your intelligence here, if I am, I truly apologize).
> sorry on point 1, that was a typo, I meant
>
> 1. The range must be known prior to execution of a gpu code
> block.
>
> as for
>
> 3. Code blocks can only receive a single range, it can however
> be multidimensional
>
> int a[100] = [ ... ];
> int b[100];
> void example3( aggregate in range r ; k, in float a[], float
> b[]){
> b[k] = a[k];
> }
> example3( 0 .. 100 , a,b);
>
> This function would be executed 100 times.
>
> int a[10_000] = [ ... ];
> int b[10_000];
> void example3( aggregate in range r ; kx,aggregate in range r2
> ; ky, in float a[], float
> b[]){
> b[kx+(ky*100)] = a[kx+(ky*100)];
> }
> example3( 0 .. 100 , 0 .. 100 , a,b);
>
> this function would be executed 10,000 times. the two aggregate
> ranges being treated as a single 2 dimensional range.
>
> Maybe a better description of the rule would be that multiple
> ranges are multiplicative, and functionally operate as a single
> range.
OH.
I think I was totally misunderstanding you earlier. The
'aggregate' is the range over the *problem space*, not the values
being punched into the problem. Is this true or false?
(if true I'm about to feel incredibly sheepish)
More information about the Digitalmars-d
mailing list