asynchronous communication between threads

Fri Jan 4 20:15:06 PST 2013

On 01/04/2013 02:56 AM, Dmitry Olshansky wrote:
> 04-Jan-2013 10:36, Charles Hixson пишет:
> [snip]
>>> So cell is in fact a task (more common name for it I think) with a
>>> mailbox and you want threads multiplexed across these tasks. The task is
>>> running some code/function/callback/whatever that periodically polls a
>>> mailbox & puts stuff in other task's mailboxes. So far good?
>>>
>>> Then definitely take a look at Fiber in druntime (it's core.Fiber
>>> AFAIK).
>>
>> It's in core.Thread. A fiber would work well for the cell, but it's
>> been pointed out to me that with this design, an ordinary class instance
>> would also work, so for that it's probably overkill. Which leaves the
>> mailbox unhandled. "Periodic" polling, as in time based, is too
>> expensive. Periodic polling as in "once in each cycle" is what is
>> planned, and that part doesn't require the overhead of threads, etc.,
>> except to keep martial messages from being read. But the documentation
>> of fiber *strongly* implies that the contents of the mailbox would be
>> copied from it's normal location to the active thread, if it's not the
>> same thread within which the cell executes. So it looks to me as if the
>> mailbox needs to be synchronized. But I'm not really sure what
>> synchronized means in D. (N.B.: The cell is logically a task, but if
>> it were implemented as such, then the mailbox would be unnecessary. But
>> the overhead would be unbearable.)
>>>
>>>>
>>>> That skeleton of code was intended to show the idea of cells isolated
>>>> from outside attached to a mailbox, which has blocking access from the
>>>> outside. The cells should never block, but this is largely because they
>>>> are only directly accessed by the thread within which they run. The
>>>> mailboxes can receive messages from anyone, but their processes are so
>>>> short, that blocking will be extremely brief.
>>>
>>> I'm sure not so optimistic about locking. Even though it's brief there
>>> are many threads that may be putting stuff simultaneously into mailboxes
>>> thus contending on the lock and causing context switches.
>>> + the lock/unlock of a mutex is not free. The lock-based message queue
>>> is nice for a start (less _subtle_ bugs) but you sure got to look into
>>> lock-free alternatives later on.
>> The number of cells is huge when compared with the number of available
>> threads, On the rough order of 1,000,000 to 6. Because of this I don't
>> expect contention to be a big problem. OTOH, the number of
>> cores/processor is going *up*, so a lock-free system would be very
>> desireable. But the only lock-free design I'm familiar with is CSMACD
>> (ethernet). And to implement that, I think the mailboxes would need to
>> be shared. And this requires some protocol such as CSMACD for each
>> access. It's not clear to me that this would have less overhead than
>> would synchronized (which locks, if I understand correctly, the class
>> instance when someone else has write access to it).
>
> CSMACD has the advantage of both parties knowing what they send and
> ability to detect collision before the end of transfer(!). In software
> design lock-free has analogous power but on scale of one memory location
> (one word). This would imply first working on the side (creating message
> etc.) then in one atomic op updating state, if there was conflict
> (somebody else put their word into memory here) you redo the updating step.
>
>> Certainly TDPL
>> talks about them sequentially, and implies that there is a tight
>> connection between shared variables and synchronized classes. It is as
>> if a synchronized class is the analog of an atomic operation acting on a
>> larger set of data. And when I look at implementing CSMACD it looks as
>> if only one thread can detect that it's write was interrupted, and then
>> only after it's complete. Unless I implement a checksum scheme, in
>> which case I guess both could detect an improperly completed write. But
>> then independent readers couldn't detect the problem. So readers would
>> need to be locked out while writing was in process...and we're back to
>> mutexes/spinlocks/etc.
>
>> So it looks to me as if a synchronized mailbox class eliminates a
>> multitude of problems a the lowest reasonable cost.
>
> Given the million of cell you are going to lock things that need not be
> locked most of the time. It's hardly a reasonable cost.
But is there any cost to keeping something locked that nobody's looking 
at?  There's clearly a RAM cost, of course, but any solution will have 
some RAM cost, and presumably the library implementation of synchronized 
will be at least as good as any that I could come up with.
>
>> If it works as I
>> hope it does. But if the mailbox ends up getting copied from thread
>> address space to thread address space, then the overhead would be much
>> higher than a naive estimate, and I can't guess how high. It *could*
>> scale so poorly that 90% of the computation consisted of mailboxes being
>> copied from thread to thread.
>
> I don't quite get yours "attached to thread" (here and latter). You mean
> as in std.concurency? Well, it was designed this way that's all. There
> is no single reason why data-structure has to be "attached" to a thread.
> And you talk like it's not you who controls what gets copied and where :)
Sorry, I keep getting confused about threads accessing RAM.  This is the 
first time I've actually TRIED to write a parallel task.  You *did* 
point out that the heap was generally accessible, and most of the time I 
remember that.  And this implies that only the reference to the 
thread... is really attached to the thread (i.e., stored in thread local 
memory).
>
> And the fear of things being copied between threads, if anything threads
> are sharing single address space (a fact of life). If you want multiple
> processes then yes, different address space and some coping is bound to
> happen.
The copying is, indeed, bound to happen.  I'm trying to design this so 
that the copying will be minimized.  But I also get confused about what 
is local to the thread.  (I didn't realize until you told me a message 
or so ago that the heap was shared.)  I knew that it was in most 
languages, but apparently I got confused when TDPL said that in D 
threads did not share memory (13,3).
>
>>>
>>>> I wanted to know if this proposed design would work, as in not getting
>>>> into deadlocks, not blocking excessively, and not giving me excessively
>>>> stale data.
>>>
>>> The the crucial part is missing - taking a message out of the mailbox ;)
>> The only entity permitted to take a message from the mailbox is the cell
>> to which it is attached, and that happens at each initiation of
>> activation. But that's one reason that the mailbox should be
>> synchronized.
>
> Then here is the thing - if unlucky enough a thread that sends will get
> blocked by the thread that reads messages. It's not deadlock but a waste
> of time.
This "waste of time" appears to me to be unavoidable.  Basically the 
cell takes ownership of the message vector, and a new one is created for 
the mailbox.  It's that, or timestamp everything, in which case I could 
use a queue. In that case, though, I'd need to copy the messages out as 
I removed them. Which would also take time, though in the "home" thread 
(i.e., the one holding a reference to the cell) rather than in the one 
that would be blocked in the first instance.
>
>>>
>>> But anyway let's focus on the details. 2 classes and 2 functions. Cell's
>>> send & MBox's receive.
>>>
>>> Let's suppose we have 2 Cells A & B and their mboxes MA & MB.
>>> From the code I see (that's full of typos? MCell & Cell are used
>>> interchangeably) the chain of event for full send from A --> B is:
>>>
>>> 1. A Cell's send locks cell A. (send is sync-ed)
>> Why does it lock the cell? But perhaps this is because the proffered
>> cell class was synchronized.
> Yup.
OK.  Since that cell is referenced in only one thread, it doesn't need 
to be synchronized.
>
>>> 2. It locks target cell B.
>> Why does it lock cell B? Cell A makes no access to cell B. Only to the
>> mailbox. Which is why in that rough design cell and mailbox were
>> separate (synchronized) classes at the same level. It it locks Cell B,
>> then the design won't work. (OTOH, the Cell function doesn't need to be
>> synchronized anyway, that's a remnant from a prior design.)
>>> 3. It then locks its mailbox MB.
>> Check.
>>> 4. undoes all the locks backwards.
>> Check.
>>>
>>> Then there is of course a deadlock bound to happen if B is sending
>>> message in opposite direction, e.g. :
>>> 1. A locks A (making its first step)
>>> 2. B lock B (ditto)
>>> 3. A locks B & blocks
>>> 3. B locks A & blocks
>> If the cells get locked, then the design will not work. No argument. I
>> was attempting to avoid that by making the mailbox a separate class, and
>> having each cell only contact the other cell's mailbox. If that doesn't
>> work, then the design is not going to work, and I'll need to create a
>> new one.
>>>
>>> that if for instance there is step 2. So I guess you haven't meant the
>>> step 2.
>>>
>>> If there is no lock of the cell except before sending then it looks
>>> legit as there are 2 locks protecting separate entities:
>>>
>>> - one is to protect message queue on putting message into it
>>>
>>> - the other one is to protect ... what exactly? send is already
>>> implicitly guarded by target queue's lock.
>> Yeah. Sorry, that was sloppy thinking on my part. The cell doesn't
>> need to be synchronized. (But I don't understand why that would be
>> destructive of anything except efficiency.)
>
> I find extra locking to also be confusing as it looks like nobody knows
> where the synchronization was required in fact.
Sorry.  It is, indeed evidence of confused thinking.  The Cell class 
doesn't need to be locked.
>
>>>
>>> So I'd say you only need to guard the message queue and that's about it.
>>> The only other concern is properly scheduling the execution of tasks (or
>>> your cells) so that one runs on no more then one thread at any given
>>> time.
>> The idea here is that each cell has an id#, and there is a pool of
>> threads. Each cell is accessible by the thread whose sequence number is
>> the same as id# mod # of threads. Each thread loops through the cells
>> accessible to it. When a cell is activated by the thread, first it
>> checks it's mailbox to update it's state. Then it processes, possibly
>> sending messages to the mailboxes of other cells in a task independent
>> way. (It won't know about threads, only about mailboxes.) Execution
>> then passes on the the next cell in the thread.
>
> So the code to process messages is the same on each cell? What about
> connections (like in see these) to other cells are they permanent?
The connections are variable, and neither predeterminable nor immutable. 
  The vary depending on the incoming data.  Also, the number of cells is 
variable depending on the incoming data.
>
>>>
>>> In simplest case just make a locked (or lock-free) queue of these and
>>> let threads pick cells/tasks from it and put back when they are done.
>> A lock-free queue would do it, I think, but the only ones I know of are
>> attached to the thread rather than to a data instance, and that WOULD
>> lead to massive contention. And increasing contention as the number of
>> cores (threads) increased.
>
> Yup, separating by ID (a round robin sort of thing) could be far better.
> That if the workload on cells is trivially balanced by partitioning
> cells beforehand (I bet not).
It might be.  I'll need to think about that.  I expect the load/cell to 
be essentially random WRT id#, but I could set things up so the next 
cell is added to the next available thread.  It's more complex, but not 
hugely so.  (For the other meaning of round robin, adding cells by id# 
mod thread count would be a round robin.)
>
>>>
>>> Far better is a large bounded queue that you never ever remove/put stuff
>>> into. It's just a big array of pointers/references to task. A thread
>>> then just goes around it looking for valid/ready entries (they stay
>>> allocated so no nulls there) and executes them. That goes firmly into
>>> lock-free zone to make it correctly synchronized though but with a bit
>>> of care it should be doable.
>> It would be quite reasonable to make the queue bounded, and reject
>> messages when the mailbox was full. But, again, if it's attached to the
>> thread rather than to the object there is going to be a massive problem
>> with contention.
>
> Again I don't get the attached to thread thing. TLS? Just don't put it
> in TLS then ;)
Perhaps the problem is that I don't know how to implement a lock-free 
queue that's thread safe?  I looked at the version in (IIRC) 
std.concurrency because that's the one I found in the library.  Queue's 
I can do, but when lock-free and thread-safe are added, I'm well beyond 
my depth.
>
>>>
>>> The second one also can be done with locks. In this case a thread goes
>>> through all of tasks/cells and tries to lock them (That's where your
>>> lock around it comes in, is it?). If it locks - cool, work on it, if not
>>> - try the next one.
>> Good point. If the cell is initiated and can't read it's mailbox (this
>> action needs a write lock, as it removes messages), then going on to the
>> next cell rather than blocking is better. (Again, if the mailbox is
>> attached to the thread, this won't work. All mailboxes will be nearly
>> constantly in contention.)
>>>
>>>
>>> > More details aren't available, because I didn't want to
>>>> commit to this design if the basic design was wrong, so I haven't
>>>> written them. It has been suggested that since the cell will only be
>>>> accessed by the thread, it doesn't need to be synchronous.
>>>>
>>>
>>>> I'm really nervous about how many copies of the cell will exist,
>>>> however. Since there are going to be so many of them, if I ended up
>>>> with a cell/thread, the system would bog down in unused storage. But
>>>> the mailbox needs to be globally accessible for the scheme to work.
>>>
>>> Everything on the heap is accessible from other threads, provided they
>>> have the pointer to the object in question.
>> Good. And since I'm working with classes, everything is really on the
>> heap, which means that only pointers will get copied. (Well,
>> references. I avoid using pointers in my code unless I really can't
>> avoid them.)
>>>
>>>
>>>> N.B.: When a cell receives a message from another cell it's likely, but
>>>> not guaranteed, to send a response back. It may also send responses
>>>> onwards. And the number of cells isn't fixed, nor is the number of
>>>> their connections. (This is less important in D than in many other
>>>> languages, but even in D it affects serialization.)
>>>>
>>>> FWIW, I've been told that an approximately similar design has worked in
>>>> Ada, though the design was written in Ada-83. (In Ada the mailbox was
>>>> protected, which I take to be approximately the same as synchronized.)
>>>
>>> In general there are ways to make it fly. The tricks to use depend on
>>> the use case and what is bottleneck (I/O or the CPU time).
>>> The pain points is faster mailboxes and better scheduling (as in less
>>> context switches for nothing, faster turn-around time etc.).
>>>
>> The bottleneck will be CPU time (given sufficient RAM). No way to avoid
>> that. Stuffing things into a mailbox is going to be basically copying a
>> struct. (That hasn't been designed yet, but it's going to include a ref
>> to the sender, a requested action, a numeric value, and I'm not sure
>> of what else. The "requested action" will probably be represented as an
>> enum. I'm probably going to avoid strings, as they don't appear to be
>> valuable, even though that's just a reference copy. So say 128 bytes of
>> parameter, or possibly 256.
>
> That's a lot no matter how you put it: 1 message per mailbox in 1M of
> cells is 128/256 megs of ram to boot. Obviously even as modest as about
> 8 is hitting gigaytes with these big kinds of messages.
>
> I bet the easiest way out is variable-length messages (so it doesn't
> have to be all 128/256 bytes). Then see the point about queue below.
>
>> And receiving a message is copying the
>> parameters into a queue.
>> Perhaps I could remove the synchronization
>> from the class, and just guard calculating the index position into the
>> queue, as once the position in the queue was known, there wouldn't be
>> any contention WRT where the message would be stored.
>
> Just keep in mind cache lines are ~ 64 bytes thus messages would have to
> be aligned by multiple of 64. That's where having a preallocated bounded
> queue of _pointers_ to messages could be nice as you never write to the
> queue itself.
>
>> That should be a
>> very fast design, but I worry about how much RAM space would be
>> required.
>
> Yup, but the RAM will eat your design anyway if messages are that big
> and the amount of cells is millions. The only sensible way to scale this
> up is having a cluster of cheap machines each running a bunch of cells
> and communicating over fast network (that's still very slow compared to
> memory access).
Yeah, but I'm not a company.  I've got what I've got (or what I replace 
it with...but it would be replace, not add to.  Space is also a constraint.)
>
>> With a pre-allocated queue, each mailbox would consume the
>> maximal amount of space even if none were used. So if there were space
>> for 20 messages, then the mailbox would consume 5120 bytes + overhead
>> for each cell. Which means that if I have 500,000 cells (an extremely
>> lowball figure) just the mailboxes would consume 1,024,000,000 bytes
>> plus overhead.
>> True. most of that would never be accessed, but enough
>> accesses would be randomly distributed throughout the system that I
>> would expect thrashing...or even failure due to inability to allocate
>> memory.
>
>> This would be compressed significantly if I used a thread
>> attached mailbox, at the cost of nearly guaranteed massive contention
>> problems.
>
>> And 20 messages is an unreasonably low upper limit, even
>> though it's too high for a mean value. (I expect the mean value to be
>> closer to 3-4.) So I'd been planning on using variable length arrays
>> for the mailbox, which would be deallocated every time the cell accessed
>> them. So messages could be attached by simply doing an append.
>
> Append = (re)allocation, that is either taking OS (or GC) lock to
> allocate or the mailbox has allocated memory chunk bigger then required
> anyway.
Well, if the mailbox is initially allocated to 4 messages, then if 
there's overflow it jumps to 8, IIUC.  Then to 16.  Etc.  So automatic 
allocations will only happen occasionally, and for most cells will never 
happen (unless I've misjudged the average number of messages).
>
>> This
>> would place the message queue on the heap, with an initially quite low
>> value for maximal # of messages. I'll admit that this may only look
>> better because I can't estimate the amount of RAM consumed.
>
> It may help after system warms up figuring out sizes for mailboxes.
> However they may tend to oscillate and keep in mind the extra space
> reserved with dynamic allocations (see above).
Yeah.  64 bytes is larger than I was guessing.
>
>> Perhaps I need to use some sort of disk cache of a data file, and only
>> have the most active cells in memory at any particular time.
>
> Forget it - let your OS memory manager eat its turf with extra swap
> space. That being said it will crawl to a stop once swapping is
> relatively high.
>
>> This would
>> still result in a lot of thrashing, but in a more controlled manner.
>> I've only around 8 GB of actual memory and it looks like 7.8 GB total
>> memory, if one includes virtual memory.
>
> you mean available? With some cleverness virtual memory can be overtaxed
> far beyond real RAM BTW (if you never end up using more then RAM+swap of
> it).
Well, I didn't mean available.  But if I decide to use virtual memory, I 
can repartition the hard disk to increase that dramatically.  Probably 
buying a new machine would be better than adding RAM, though. My current 
machine is rather old.  But I don't want to do that until I have a 
working program to use on it.  And I can't yet buy enough machine for 
what I can afford, but then the program isn't ready yet either.
>
>> Perhaps this needs to be
>> upgraded...which would probably mean I upgraded the number of cores
>> available, meaning increased contention.
>
> Given the numbers that's least of worries. Unnecessary locking is.
>
>> But I could clearly upgrade
>> the amount of virtual memory with just a repartitioning of the disk.
>> It's not been needed so far, but then I've just been designing this
>> system, not implementing it. OTOH, virtual RAM is thrashing, so it's
>> not clear how much that would help over, say, a BTree that rolled out
>> relatively inactive cells, even though each cell would need to be RAM
>> resident at least once per cycle,
>>
>> That said, the application I'm designing would probably overstress any
>> computer I could afford. I'm almost resigned to that. I just want to
>> come as close to reasonable speed of execution as possible, and I
>> clearly want to avoid deadlocks and data that doesn't get properly
>> updated.
>>
>
> Then just run simulations of as big as you can and optimize speed, then
> keep this size fixed and optimize RAM used, then over again.
> The RAM vs speed should be more or less analyzable (and extrapolable)
> even if tested on a select bunch of modest PCs.
"select bunch of modest PCs" is a problem.  I've got a few old 
computers, but no space to use them in.  (Seriously, I should really get 
rid of them, but one if them is occasionally useful when the active 
machine is in the shop.)  Space is a lot more expensive than computers.
FWIW, this program I'm describing *IS* a simulation of a more complex 
system, drastically simplified already.  I've got less trouble 
sacrificing speed than capabilities.  But I could reduce the size 
requirements by using a restricted range of inputs.
>
>> OTOH, rolling things out to a B+Tree means that I need to devise a way
>> to access the mailbox based around the target's id# rather than around a
>> reference to the item. A hash table is the obvious solution, but the
>> class managing the hash table would need to roll the cell+mailbox in if
>> it isn't RAM resident. Not something that's reasonable to do while the
>> mailbox access is locked. So the mailbox would need to queue the
>> request for access to the cell's mailbox with a thread, stuff the
>> message in a "to be acted upon" queue, and return. And where that queue
>> should be stored isn't clear to me.
>
> That's why I suggest to not even begin thinking about it - virtual ram
> and OS does it far better (and with hardware assistance from CPU).
>
> Forget the hash table if RAM is premium, use straight arrays they have
> the least per item overhead.
Yes.  A hash table, while obvious, is clearly not the right idea.  I do 
find myself attracted to a variation on a B+Tree with fancy rebalancing, 
so that each node is more than 2/3 full.  (I think I could get it to 
average over 3/4 full, excluding the root node.) This, however, is 
computationally expensive.  It's main advantages are that then the limit 
to the structure is disk space, and persistence is automatically 
maintained.  It would, however, require another redesign.

OTOH, *some* means of saving state persistently will be necessary, as I 
can't dedicate my computer to nothing but this program.  Also deleting 
cells that have become very inactive may prove to be difficult, and 
might be better done while the rest of the program is inactive.  (That 
would mean marking the cell for deletion during normal operation, and 
then running the garbage collector separately.  Facilitating this is one 
reason that all links need to be bidirectional.
>
>> Perhaps that's where the thread
>> attached non-blocking queue should come into play. Also, hash tables
>> have enough overhead themselves that they would limit the number of RAM
>> resident cells considerably over prior estimates, even while increasing
>> the total number that could be dealt with. Probably they would halve
>> the number of RAM resident cells (a rough estimate, admittedly), while
>> expanding the total number of cells that could be handled to be limited
>> by available disk space. They would also impose a severe performance
>> penalty. (In prior small scale tests, hash table access has been very
>> roughly about 1/10 as fast as variable length array access. Of course,
>> this is still a lot faster than disk file access.)
>>
>> Still, it's sounding like the basic design will work, unless a cell
>> calling the mailbox of another cell locks that cell, and I don't see any
>> reason why it should...but I've been repeatedly surprised by the
>> requirements imposed on concurrently executing threads of execution.
>
> There is no deadlock as discussed (+ not a single call tries to grab 2
> locks at the same time). However if there is a single lock around
> mailbox (not 2 separate for different put/get) you'd need to use "try
> lock" trick mentioned to prevent the awkward blocking of one thread when
> reading & writing collide on the same mailbox.
That sounds like good advice.
>
> Also if adventurous enough to look at lock-free I suggest to read this
> piece by Herb Sutter:
> http://www.drdobbs.com/parallel/writing-a-generalized-concurrent-queue/211601363?pgno=2
>
 From that link (right near the start):
     "Like any shared variable, the next pointer needs to be
      protected by a mutex or, as here, declared as an ordered
      atomic type (C++0x atomic<> or Java/.NET volatile)."
To me that sounds like synchronized, since it can't be atomic, being a 
struct.  My C++ is pretty weak, though.  I'm guessing at what is meant 
by "ordered atomic".  In D, however, I believe that 64 bits is the 
length of the longest atomic type.  This is a bit of a guess, but real 
types can't be atomic on Intel cpus, because they are 80 bits long.

OTOH, the last bit of TDPL (pg 426-429 of my copy) talks about lock-free 
stacks and lists.  For the mailbox I think a stack would do as well as a 
queue as it's going to be emptied entirely when the cell accesses it, 
though perhaps the list would be better, but the stack is simpler.  I'd 
need to adapt the pop routine into a "popAll", but that looks 
straight-forwards.
>
> it's quite short and it neatly deals with a lot of problems in making
> concurrent queue in a portable way. (note that you have
> multiproducer--single consumer queue)
>