Thoughts on parallel programming?

Thu Nov 11 12:01:09 PST 2010

Thu, 11 Nov 2010 19:41:56 +0000, Russel Winder wrote:

> On Thu, 2010-11-11 at 15:16 +0100, Fawzi Mohamed wrote: [ . . . ]
>> on this I am not so sure, heterogeneous clusters are more difficult to
>> program, and GPU & co are slowly becoming more and more general
>> purpose. Being able to take advantage of those is useful, but I am not
>> convinced they are necessarily the future.
> 
> The Intel roadmap is for processor chips that have a number of cores
> with different architectures.  Heterogeneity is not going going to be a
> choice, it is going to be an imposition.  And this is at bus level, not
> at cluster level.
> 
> [ . . . ]
>> yes many core is the future I agree on this, and also that distributed
>> approach is the only way to scale to a really large number of
>> processors.
>> Bud distributed systems *are* more complex, so I think that for the
>> foreseeable future one will have a hybrid approach.
> 
> Hybrid is what I am saying is the future whether we like it or not.  SMP
> as the whole system is the past.
> 
> I disagree that distributed systems are more complex per se.  I suspect
> comments are getting so general here that anything anyone writes can be
> seen as both true and false simultaneously.  My perception is that
> shared memory multithreading is less and less a tool that applications
> programmers should be thinking in terms of.  Multiple processes with an
> hierarchy of communications costs is the overarching architecture with
> each process potentially being SMP or CSP or . . .
> 
>> again not sure the situation is as dire as you paint it, Linux does
>> quite well in the HPC field... but I agree that to be the ideal OS for
>> these architectures it will need more changes.
> 
> The Linux driver architecture is already creaking at the seams, it
> implies a central monolithic approach to operating system.  This falls
> down in a multiprocessor shared memory context.  The fact that the Top
> 500 generally use Linux is because it is the least worst option.  M$
> despite throwing large amounts of money at the problem, and indeed
> bought some very high profile names to try and do something about the
> lack of traction, have failed to make any headway in the HPC operating
> system stakes.  Do you want to have to run a virus checker on your HPC
> system?
> 
> My gut reaction is that we are going to see a rise of hypervisors as per
> Tilera chips, at least in the short to medium term, simply as a bridge
> from the now OSes to the future.  My guess is that L4 microkernels
> and/or nanokernels, exokernels, etc. will find a central place in future
> systems.  The problem to be solved is ensuring that the appropriate ABI
> is available on the appropriate core at the appropriate time.  Mobility
> of ABI is the critical factor here.
> 
> [ . . . ]
>> Whole array operation are useful, and when possible one gains much
>> using them, unfortunately not all problems can be reduced to few large
>> array operations, data parallel languages are not the main type of
>> language for these reasons.
> 
> Agreed.  My point was that in 1960s code people explicitly handled array
> operations using do loops because they had to.  Nowadays such code is
> anathema to efficient execution.  My complaint here is that people have
> put effort into compiler technology instead of rewriting the codes in a
> better language and/or idiom.  Clearly whole array operations only apply
> to algorithms that involve arrays!
> 
> [ . . . ]
>> well whole array operations are a generalization of the SPMD approach,
>> so I this sense you said that that kind of approach will have a future
>> (but with a more difficult optimization as the hardware is more
>> complex.
> 
> I guess this is where the PGAS people are challenging things.
> Applications can be couched in terms of array algorithms which can be
> scattered across distributed memory systems.  Inappropriate operations
> lead to huge inefficiencies, but handles correctly, code runs very fast.
> 
>> About MPI I think that many don't see what MPI really does, mpi offers
>> a simplified parallel model.
>> The main weakness of this model is that it assumes some kind of
>> reliability, but then it offers
>> a clear computational model with processors ordered in a linear of
>> higher dimensional structure and efficient collective communication
>> primitives.
>> Yes MPI is not the right choice for all problems, but when usable it is
>> very powerful, often superior to the alternatives, and programming with
>> it is *simpler* than thinking about a generic distributed system. So I
>> think that for problems that are not trivially parallel, or easily
>> parallelizable MPI will remain as the best choice.
> 
> I guess my main irritant with MPI is that I have to run the same
> executable on every node and, perhaps more importantly, the message
> passing structure is founded on Fortran primitive data types.  OK so you
> can hack up some element of abstraction so as to send complex messages,
> but it would be far better if the MPI standard provided better
> abstractions.
> 
> [ . . . ]
>> It might be a personal thing, but I am kind of "suspicious" toward
>> PGAS, I find a generalized MPI model better than PGAS when you want to
>> have separated address spaces.
>> Using MPI one can define a PGAS like object wrapping local storage with
>> an object that sends remote requests to access remote memory pieces.
>> This means having a local server where this wrapped objects can be
>> "published" and that can respond in any moment to external requests. I
>> call this rpc (remote procedure call) and it can be realized easily on
>> the top of MPI.
>> As not all objects are distributed and in a complex program it does not
>> always makes sense to distribute these objects on all processors or
>> none, I find that the robust partitioning and collective communication
>> primitives of MPI superior to PGAS. With enough effort you probably can
>> get everything also from PGAS, but then you loose all its simplicity.
> 
> I think we are going to have to take this one off the list.  My summary
> is that MPI and PGAS solve different problems differently.  There are
> some problems that one can code up neatly in MPI and that are ugly in
> PGAS, but the converse is also true.
> 
> [ . . . ]
>> The situation is not so dire, some problems are trivially parallel, or
>> can be solved with simple parallel patterns, others don't need to be
>> solved in parallel, as the sequential solution if fast enough, but I do
>> agree that being able to develop parallel systems is increasingly
>> important.
>> In fact it is something that I like to do, and I thought about a lot. I
>> did program parallel systems, and out of my experience I tried to build
>> something to do parallel programs "the way it  should be", or at least
>> the way I would like it to be ;)
> 
> The real question is whether future computers will run Word,
> OpenOffice.org, Excel, Powerpoint fast enough so that people don't
> complain.  Everything else is an HPC ghetto :-)
> 
>> The result is what I did with blip, http://dsource.org/projects/blip .
>> I don't think that (excluding some simple examples) fully automatic
>> (trasparent) parallelization is really feasible. At some point being
>> parallel is more complex, and it puts an extra burden on the
>> programmer.
>> Still it is possible to have several levels of parallelization, and if
>> you program a fully parallel program it should still be possible to use
>> it relatively efficiently locally, but a local program will not
>> automatically become fully parallel.
> 
> At the heart of all this is that programmers are taught that algorithm
> is a sequence of actions to achieve a goal.  Programmers are trained to
> think sequentially and this affects their coding.  This means that
> parallelism has to be expressed at a sufficiently high level that
> programmers can still reason about algorithms as sequential things.
> 
> [ . . . ]

FWIW, I'm not a parallel computing expert and have almost no experience 
of it outside basic parallel programming courses, but it seems to me that 
the HPC clusters are a completely separate domain.

It used to be the case that *all* other systems are single core and only 
the HPC consists of (hybrid multicore setups of) several nodes. Now what 
is happening is both embedded and mainstream PCs are getting multiple 
cores in both CPU and GPU chips. Multi-socket setups are still rare. The 
growth rate maybe follows the Moore's law at least in GPUs, in CPUs the 
problems with programmability are slowing things down and many laptops 
are still dual-core despite multiple cores are more energy efficient than 
higher GHz and my home PC has 8 virtual cores in a single CPU.

The HPC systems with hundreds of processors are definitely still 
important, but I see that 99.9(99)% of the market is in desktop and 
embedded systems. We need efficient ways to program multicore mobile 
phones, multicore laptops, multicore tablet devices and so on. These are 
all shared memory systems. I don't think MPI works well in shared memory 
systems. It's good to have MPI like system for D, but it cannot solve 
these problems.