Parallelizing code -- design problem in networkish code

Charles Hixson charleshixsn at earthlink.net
Thu Dec 13 20:38:19 PST 2012


On 12/13/2012 07:30 PM, Ali Çehreli wrote:
> Parallelism, concurrency, and multi-threading in general are fascinating
> topics. I can't claim that I have extensive experience on these topics
> but I have written the following two chapters after studying the
> std.parallelism and std.concurrency modules:
>
> http://ddili.org/ders/d.en/parallelism.html
>
> http://ddili.org/ders/d.en/concurrency.html
>
> Still, when I tried some of those ideas, I did not always see the
> performance gains that I had expected, at least with a particular test
> program. After reading articles like the following, now I understand how
> perhaps counter-intuitively, single-threading can be way faster than
> multi-threading, including the CAS-style lock-free multi-threading. The
> following has been posted to D forums recently:
>
> http://martinfowler.com/articles/lmax.html
>
> Sorry to ramble on about this topic. :) I am thinking out loud that next
> time I will try to make the actual processing of data single-threaded as
> I have learned from the Disruptor architecture and see whether that
> makes it faster.
>
> On 12/13/2012 01:32 PM, Charles Hixson wrote:
>  > I'm trying to parallelize some code which is essentially a network of
>  > cells with an index. I can make the cells immutable,
>
> Because you say "parallelize", I don't think you need to make your data
> immutable at all, because parallelism requires that the data is not shared.
>
> Even if you need to share the data, if you are doing message passing
> concurrency, then you can safely cast your immutable data to mutable as
> long as you know that there is only one thread that accesses it at a
> given time. Message-passing makes it easy to reason about.
>
> But I think 'shared' is likely better than 'immutable' in your case.
>
> Not knowing your exact situation, I would recommend the simplest option
> first, which is std.parallelism:
>
> import std.stdio;
> import std.parallelism;
>
> void main()
> {
> auto array = [ 1, 2 ];
>
> foreach (element; parallel(array)) {
> // This block is executed on the elements in parallel
> writeln(element);
> }
> }
>
> std.parallelism works on ranges, which makes it possible to parallelize
> operations lazily, which enables making the objects as needed. Here is a
> thousand ints that are processed in parallel, which never need to live
> in an actual array:
>
> import std.stdio;
> import std.parallelism;
> import std.range;
>
> void main()
> {
> auto numbers = iota(1000);
>
> foreach (element; parallel(numbers)) {
> writeln(element);
> }
> }
>
> In case you are not already familiar, ranges are fundamentally a very
> simple concept. There is the following chapter on ranges:
>
> http://ddili.org/ders/d.en/ranges.html
>
> Ali
>
std.parallelism is indeed what I was originally thinking of using.  But 
it explicitly states that it doesn't support thread safety except on 
routines that are marked @safe or @trusted, which ParallelForEach isn't. 
  It's true that the TaskPool.this methods are marked @trusted, but it's 
not clear to me that this implies that the indirect use in:
foreach(i, ref elem; taskPool.parallel(logs, 100))
is trusted.  If it is, then that's a good answer.

Now what I was thinking of involved an array in another class (i.e., not 
the Cell class) defined:
	Cell[]	cells;
and the Cell class, which includes:
public class Cell
{  ...
    Idno[]	ups;
    ...
}
where ups is an array of Id#s which are indexes into cells.  While 
processing a cell, it's likely that ups will be used to index into cells 
to adjust the weight of the similar reference down to this cell.  (And 
yes, there's an array of downs also, which also does an indirection 
through cells to adjust weights of cells below it that refer to the 
current cell.

This process offers several possibilities of collision when done in 
parallel, which was why I though that the alterations should probably be 
done via creating a new cell and substituting it for the original one. 
If this isn't necessary, it would be a real benefit.

Additionally, it's quite possible that a cell would be added to the 
cells array during activation, but your point about this being handled 
via ranges is a good answer to that worry.

Perhaps a part of the reason I'm not understanding is that all the 
examples involve cases where there isn't this kind of mutual 
interaction.  But to me it looks as if my choices are making the cells 
immutable (doable, if unpleasant) or placing write synchronization locks 
around every place where I alter a weight.  Only TDPL indicates that if 
I do that, I also need to place the same locks around every place where 
a value might be read.  Which is much worse than just making the cells 
immutable.  (Making the cells immutable is bad enough, as there are lots 
more pieces than I've shown, and if they are to be immutable, the 
constructor has to set ALL of them in their final form.)

P.S.:  I haven't checked, but does 
std.parallism.ParallelForeach!(R)parallel(R)(R range) copy the specified 
range to the RAM space of the other thread?  If so, I may not be able to 
use it due to excessive RAM consumption.  If it just passes the range, 
and gets the items from the originating thread as needed, this would be 
no problem.  Actually, reading the chapter you wrote is appears that 
parallel() does all it's work in the main thread, and only passes the 
items to be processed to tasks as free threads in the threadpool become 
available.  Which is what I would want.  I wish that I were a bit more 
certain.

But at this point the main question is "Is there a decent way to avoid 
the requirement of making cells immutable?"  Perhaps it would be 
something like:
void modifyWeight (shared Cell thisCell, shared Cell upCell,
                float weight)
{ synchronized (thisCell, upCell)
    {
....
    }
}
Only shared variables isn't at all what I want...or I don't think it is. 
  Certainly the cells array can't be shared, as that would prevent me 
from adding new cells to it. But on the other hand, I sure don't want 
the cells array copied into every new thread.  That would grind the 
system to a halt in no time.

Am I just confusing myself by reading about different approaches to 
parallelism, or is this just something that can't really be handled. 
When a thread gets started, just what gets copied to it?  If the array 
is in a class instance, does only the reference to the class get copied, 
or does the entire class instance get copied?  If the whole thing gets 
copied, then this is clearly not going to work, but I really can't be sure.

All of the examples that I find concern themselves with trivial amounts 
of data, so that it doesn't really matter if it gets copied around, but 
that's not the case here. Here is the data is going to get copied, then 
the entire threading approach needs to be scrapped.  Although....

Perhaps another approach is to start a thread, and store the data in 
that thread.  Then to find the data, you need to query that thread, to 
save the data, you save it to that thread.  All data file I/O would take 
place within that thread.  This would require a lot of planning that I 
haven't yet done, but it doesn't seem totally implausible.  I'd probably 
need to do manual management of sending things to the threadpool rather 
than using the parallel method.  But I still need to ensure that there 
aren't write collisions, so that still looks as if I need to make the 
cells immutable.  And lots more data will be flowing around, but not as 
much at any one time, so the top RAM requirement should be less.  (My 
rough guess is that twice as much data would be flowing, but the top RAM 
requirements might be 0.01 of an approach that involved copying the 
entire array once.  I can't even take once per thread seriously as a 
possibility.)

OTOH...it does look as if on my current system (6 cores) the serial form 
would be faster.  But I don't expect to stay on my current system 
forever.  Intel has announced that tablets in a few years will have over 
40 cores.  So the design should be for running on lots of cores, even if 
that's not what I have right now.


More information about the Digitalmars-d-learn mailing list