GPGPUs
luminousone
rd.hunt at gmail.com
Sun Aug 18 01:08:33 PDT 2013
On Sunday, 18 August 2013 at 07:28:02 UTC, Atash wrote:
> On Sunday, 18 August 2013 at 06:22:30 UTC, luminousone wrote:
>> The Xeon Phi is interesting in so far as taking generic
>> programming to a more parallel environment. However it has
>> some serious limitations that will heavily damage its
>> potential performance.
>>
>> AVX2 is completely the wrong path to go about improving
>> performance in parallel computing, The SIMD nature of this
>> instruction set means that scalar operations, or even just not
>> being able to fill the giant 256/512bit register wastes huge
>> chunks of this things peak theoretical performance, and if any
>> rules apply to instruction pairing on this multi issue
>> pipeline you have yet more potential for wasted cycles.
>>
>> I haven't seen anything about intels, micro thread scheduler,
>> or how these chips handle mass context switching natural of
>> micro threaded environments, These two items make a huge
>> difference in performance, comparing radeon VLIW5/4 to radeon
>> GCN is a good example, most of the performance benefit of GCN
>> is from easy of scheduling scalar pipelines over more complex
>> pipes with instruction pairing rules etc.
>>
>> Frankly Intel, has some cool stuff, but they have been caught
>> with their pants down, they have depended on their large fab
>> advantage to carry them over and got lazy.
>>
>> We likely are watching AMD64 all over again.
>
> Well, I can't argue that one.
>
>> A first, simply a different way of approaching std.parallel
>> like functionality, with an eye gpgpu in the future when easy
>> integration solutions popup(such as HSA).
>
> I can't argue with that either.
>
>> It would be best to wait for a more generic software platform,
>> to find out how this is handled by the next generation of
>> micro threading tools.
>>
>> The way openCL/CUDA work reminds me to much of someone setting
>> up tomcat to have java code generate php that runs on their
>> apache server, just because they can. I would rather tighter
>> integration with the core language, then having a language in
>> language.
>
> Fair point. I have my own share of idyllic wants, so I can't
> argue with those.
>
>> Low level optimization is a wonderful thing, But I almost
>> wonder if this will always be something where in order todo
>> the low level optimization you will be using the vendors
>> provided platform for doing it, as no generic tool will be
>> able to match the custom one.
>
> But OpenCL is by no means a 'custom tool'. CUDA, maybe, but
> OpenCL just doesn't fit the bill in my opinion. I can see it
> being possible in the future that it'd be considered
> 'low-level', but it's a fairly generic solution. A little
> hackneyed under your earlier metaphors, but still a generic,
> standard solution.
I can agree with that.
>> Most of my interaction with the gpu is via shader programs for
>> Opengl, I have only lightly used CUDA for some image
>> processing software, So I am certainly not the one to give in
>> depth detail to optimization strategies.
>
> There was a *lot* of stuff that opened up when vendors dumped
> GPGPU out of Pandora's box. If you want to get a feel for some
> optimization strategies and what they require, check this site
> out: http://www.bealto.com/gpu-sorting_intro.html (and I hope
> I'm not insulting your intelligence here, if I am, I truly
> apologize).
I am still learning and additional links to go over never hurt!,
I am of the opinion that a good programmers has never finished
learning new stuff.
>> sorry on point 1, that was a typo, I meant
>>
>> 1. The range must be known prior to execution of a gpu code
>> block.
>>
>> as for
>>
>> 3. Code blocks can only receive a single range, it can however
>> be multidimensional
>>
>> int a[100] = [ ... ];
>> int b[100];
>> void example3( aggregate in range r ; k, in float a[], float
>> b[]){
>> b[k] = a[k];
>> }
>> example3( 0 .. 100 , a,b);
>>
>> This function would be executed 100 times.
>>
>> int a[10_000] = [ ... ];
>> int b[10_000];
>> void example3( aggregate in range r ; kx,aggregate in range r2
>> ; ky, in float a[], float
>> b[]){
>> b[kx+(ky*100)] = a[kx+(ky*100)];
>> }
>> example3( 0 .. 100 , 0 .. 100 , a,b);
>>
>> this function would be executed 10,000 times. the two
>> aggregate ranges being treated as a single 2 dimensional range.
>>
>> Maybe a better description of the rule would be that multiple
>> ranges are multiplicative, and functionally operate as a
>> single range.
>
> OH.
>
> I think I was totally misunderstanding you earlier. The
> 'aggregate' is the range over the *problem space*, not the
> values being punched into the problem. Is this true or false?
>
> (if true I'm about to feel incredibly sheepish)
Is "problem space" the correct industry term, I am self taught on
much of this, so I on occasion I miss out on what the correct
terminology for something is.
But yes that is what i meant.
More information about the Digitalmars-d
mailing list