Good demo for showing benefits of parallelism

Fri Jan 26 19:11:36 PST 2007

Jascha Wetzel wrote:
> i had a simple, unoptimized raytracer from an old CG assignment lying
> around that i ported to D and modified for parallel computation.
> download here: http://mainia.de/prt.zip
> 
> on a single core/cpu machine, the time stays about the same until 4
> threads, then the overhead kicks in.
> here are some values from an opteron dual core system running linux
> kernel 2.6 for different thread counts:
> thrds seconds
> 1     32.123
> 2     32.182
> 3     29.329
> 4     28.556
> 8     21.661
> 16    20.186
> 24    20.423
> 32    21.410
> 
> these aren't quite what i expected. CPU usage shows that both cores get
> about 55-80% load with 2 threads, stablilizing at 65-75% with 16
> threads. with a single thread it's clearly 100% on one core.
> 
> am i missing something about the pthread lib or memory usage/sync that
> prevents reasonable speedup? 160% is nice, but why does it take 16
> threads to get there? and where exactly do the remaining 40% go?
> 
> Bill Baxter wrote:
>> I don't remember where it was, but somebody in some thread said
>> something like gee it would be great if we had a demo that showed the
>> benefits of having parallel threads.  (It was probably in either the
>> Futures lib thread or in the discussion about varargs_reduce).
>>
>> Anyway, a raytracer is a perfect example of "embarrasingly
>> parallelizeable" code.  In the simplest case you can state it simply as
>> "for each pixel do trace_ray(ray_dir)".
>>
>> Bradley Smith posted code for a little raytracer over on D.learn under
>> the heading "Why is this code slower than C++".  If anyone is interested
>> in turning this into a parallel raytracer that would be a nice little demo.
>>
>> Along the same lines, almost any image processing algorithm is in the
>> same category where the top loop looks like "foreach pixel do ___". This
>> is a big part of why GPUs just keep getting faster and faster, because
>> the basic problem is just so inherently parallelizable.
>>
>> --bb

Do you know if there was much lock contention?  If there is a lot, then 
you can lose efficiency because the threads bump into each other like a 
traffic jam.  Failing to get a lock (pausing and waking up again when it 
frees up) is a lot slower than getting it successfully on the first try.

There was some code I had at work where I do something like this:

Mutex.lock();

     int length = (end-begin)*4;

     char * p = (finds a byte location in an mmapped file);

     int remainder = *p;  // A
     length += remainder; // A

Mutex.unlock();

I was able to improve performance measurably by moving the unlock() to a 
point before the lines marked "A".

It turns out the *p was (often) causing a soft page fault, which caused 
the thread to stall, which caused other threads to pile up on the mutex.

If you can minimize your critical sections, you can reduce contention. 
The ideal thing is to use a critical section to swap pointers or 
something else that can't take much time.  If a memory read can trigger 
page fault, you can sometimes do it *before* the critical section, then 
get the lock and get the 'correct' value.  This way the memory page is 
"known to be warm" by the time you are in the lock, even if the actual 
value may change.

[ But this kind of fiddling around is pretty low level, so it won't have 
any effect most of the time -- make sure to have the right problem 
before fixing it. ]

Kevin