Good demo for showing benefits of parallelism

Sat Jan 27 01:48:58 PST 2007

Jascha Wetzel wrote:
> Kevin Bealer wrote:
>> Do you know if there was much lock contention?
> 
> there is no explicit synchronization at all. the nice thing about
> raytracing and similar algorithms is, that you can use disjoint memory
> and let the threads work isolated from each other. except for the main
> thread that waits for the workers to terminate, there are no muteces
> involved.

Hmmm... if the chunks you use are too small or too large, it can be an 
issue too.  If you divide the workload into small chunks like individual 
pixels, it costs overhead due to switching and loss of locality of 
reference.

On the other hand, if you have ten threads, I think it is a mistake to 
divide the pixel load into exactly ten parts -- some pixels are heavier 
than others.  The pixels that raytrace complex refractions will take 
longer and so some threads finish first.

What I would tend to do is divide the pixel count into blocks, then let 
each thread pull a block of (consecutive or adjacent if possible) pixels 
to work on.  Then some threads can do hard blocks and take a while at 
it, and others can do simple blocks and process more of them.

The goal is for all the threads to finish at about the same time.  If 
they don't you end up with some threads waiting idly at the end.

This would require reintroducing a little synchronization.  I might 
divide an imagine into blocks that were individually each about 2% of 
the total image.  Let's say you have 4 hardware threads and each block 
is 2% of the work load.  Then the average inefficiency is maybe 
something like 1/2 * .02 * 4 = 4 %.  This is 1/2 of the threads being 
idle as they finish at randomly uneven rates, times 2 % of the total 
work, times 4 hardware threads (because it doesn't hurt much to have 
idle *software* threads, only idle hardware ones.)

Other micro-considerations:

1. Handing out 'harder' sections first is a good idea if you have this 
info, because these will "stick out" more if they are the only one 
running at the end (they represent a larger percentage than their block 
size.)

2. You can start by handing out large chunks and then switch to smaller 
chunks when the work is mostly done.  For example, for 4 hardware 
threads, you could always hand out 1/16th of the remaining work until 
you get down to handing out 100 pixels at a time.

(I don't know how useful this is for raytracing specifically, but some 
of these issues come up where I work.)

Kevin