D and multicore

Sat Nov 13 17:44:08 PST 2010

On 13-nov-10, at 22:23, Gary Whatmore wrote:

> parallel noob Wrote:
>
>> Hello
>>
>> Intro: people with pseudonyms are often considered trolls here, but  
>> this is a really honest question by a sw engineer now writing  
>> mostly sequential web applications.  (I write "parallel" web apps,  
>> but the logic goes that you write sequential applications for each  
>> http query, the frontend distributes queries among backend http  
>> processes, and the database "magically" ensures proper locking.  
>> There's hardly ever any locks in my code.)
>>
>> D is touted as the next gen of multicore languages. I pondered  
>> between D and D.learn, where to ask this. It just strikes me odd  
>> that there isn't any kind of documentation explaining how I should  
>> write (parallel) code for multicore in D. If D is much different,  
>> the general guidelines for PHP web applications, Java, or Erlang  
>> might don't work. From what I've gathered from these discussions,  
>> there are:
>>
>> - array operations and auto-parallelization of loops
>> - mmx/sse intrinsics via library
>> - transactional memory (requires hardware support? doesn't work?)
>> - "erlang style" concurrency? == process functions in Phobos 2?
>> - threads, locks, and synchronization primitives
>>
>> Sean, sybrandy, don, fawzi, tobias, gary, dsimcha, bearophile,  
>> russel, trass3r, dennis, and simen clearly have ideas how to work  
>> with parallel problems.
>>
>> A quick look at wikipedia gave http://en.wikipedia.org/wiki/Parallel_computing 
>>  and http://en.wikipedia.org/wiki/Parallel_programming_model
>>
>> I fail to map these concepts discussed here with the things listed  
>> on those pages. I found MPI, POSIX Threads, TBB, Erlang, OpenMP,  
>> and OpenCL there.
>>
>> Sean mentioned:
>>
>> "In the long term there may turn out to be better models, but I  
>> don't know of one today."
>>
>> So he's basically saying that those others listed in the wikipedia  
>> pages are totally unsuitable for real world tasks? Only Erlang  
>> style message passing works?
>>
>> The next machine I buy comes with 12 or 16 cores or even more --  
>> this one has 4 cores. The typical applications I use take advantage  
>> of 1-2 threads. For example a cd ripper starts a new process for  
>> each mp3 encoder. The program runs at most 3 threads (the gui, the  
>> mp3 encoder, the cd ripper). More and more applications run in the  
>> browser. The browser actively uses one thread + one thread per  
>> applet. I can't even utilize more than 50% of the power of the  
>> current gen!
>>
>> The situation is different with GPUs. My Radeon 5970 has 3200  
>> cores. When the core count doubles, the FPS rating in games almost  
>> doubles. They definitely are not running Erlang style processes  
>> (one for GUI, one for sounds, one for physics, one for network).  
>> That would leave 3150 cores unused.

there are different kinds of parallel problems, some are trivially, or  
almost trivially parallel, other are less parallel.
Some tasks are very quick (one talks of micro parallelism), other are  
much more coarse.

typical code has a limited parallelization potential, out of order  
execution of modern processors tries to take advantage of this, but  
normally having a lot of execution hardware is not useful because  
there is a limited amount of instruction level parallelism (ILP) is  
limited.
There is an important exception: vector operations. So processor often  
have vector hardware to do them efficiently. Compiler to take  
advantage of them vectorize loops.
Array operations are a class of operations (that include vector  
operations) that are often very parallel.
If one for example wants to apply a pure operations on an array this  
is trivially parallel.
Data parallel languages are especially good to express this kind of  
parallelism.

GPU are optimized graphical operations which are mainly kind of vector  
and array operations and thus have a large amount of this kind of  
parallelization. This is also present in some scientific programs, and  
indeed GPU (with CUDA or openCL) are increasingly used for that.

The more coarse levels of parallelization use other means.
In my opinion shared memory parallelization can be done efficiently if  
on is able to treat independent recursive tasks.
Recursive task (that come form example from divide & conquer  
approaches, and for example can be used to perform array operations)  
can be evaluated efficiently evaluating subtasks first, and stealing  
supertasks keeping into account the locality of processors (cilk has a  
such an approach).

Independent tasks can be represented by threads, should be executed  
fairly, and work well to represent a single interacting object or  
different requests for a web server.
All OS give support for this, as threads (unlike processes) share  
memory one has to take care that changes from one thread and another  
are done in a meaningful way. To achieve this there are locks, atomic  
ops,...
Transactional memory works for changes that should be done atomically  
(the big problem is that if something fails one has to undo everything  
and retry again, something that becomes more and more likely the  
longer the transaction).

What I tried to achieve with blip.parallel.smp is to accommodate  
reasonably well both kinds of parallelism.
Several higher level abstractions can then be built using this  
framework.
I feel that it is important to handle this centrally because a  
processor is used optimally when each core has a thread.
Actually to hide the latency introduced by some operations that would  
make a processor waste cycles, some processors keep two active threads  
and switch quickly from one to the other when one stalls. This is  
called hyperthreading, and while it doesn't improve the performance of  
a single thread (actually it is worse) it optimizes the throughput and  
and usage of the execution units of a single core.

I think that almost all shared memory parallelization approaches can  
be implemented reasonably efficiently on the top of tasks, possibly  
using some locality hints for the initial distribution.

PGAS (Partitioned global address space) tries to allow one to have a  
global view, while still having local storages, each process can  
access both his local and remote just indexing. the conceptual  
advantage of this is that one can easily migrate to it starting from a  
local view. The hope is that one can optimize the layout and patterns  
used to access the memory later and at least partially independently  
from the algorithm used. At least in some cases it is possible.

In distributed memory approaches each process has its own memory  
space, and cannot access directly the memory of other processes.
This is potentially more complex than PGAS/shared memory approaches,  
because one has to explicitly transfer data between different  
processes to communicate. Normally one uses *messages* to do it.
The big advantage of doing this is that the programmer has to think  
more explicitly about the potential costly communication operations  
(if one has a latency of few ms this is still 10^6 cycles of a typical  
processor), and possibly optimize it better.

The distributed memory space scales very well, as one can create  
easily large clusters of computers.
MPI (message passing interface) uses this approach.
But actually mpi is more about having collective communication  
patterns efficiently implemented and being able to create optimal  
subsets to do subwork. MPI is the correct choice if you have a complex  
problem (also for the parallelization) and you want to efficiently use  
*all* resources available.

Sometime the problem you have is not so costly that you need to commit  
all resources to it, you just want to solve it efficiently and if  
possible taking advantage of the parallelization.
In this case a good model is the actor model where objects communicate  
with messages to each other.
One can have thread objects with mailboxes and pattern matching to  
select the message, or objects with an interface and a remote  
procedure call to invoke them.
You can organize the network of messages in several ways, you can have  
a central server, and clients that connect, you can have a central  
database to communicate, you can have a peer to peer structure, you  
can have producer/consumer relationships.
Normally given a problem one can see how to partition it optimally.

I use blip.parallel.rpc to give that kind of messaging between objects.
Note that one has to think about failure of one part in  this model,  
not necessarily a failure of one process should stop all processes (in  
some cases it might even be undetected).
At this level one could theoretically migrate processes/objects  
automatically, but given that the latency increase can be very large  
(~10^6) this automatic distribution is doable only for tasks that were  
considered by the programmer, a fully automatic redistribution of any  
object is not realistic.

I hope this overview helps a bit
Fawzi