asynchronous communication between threads

Mon Jan 7 11:47:24 PST 2013

05-Jan-2013 08:15, Charles Hixson пишет:
> On 01/04/2013 02:56 AM, Dmitry Olshansky wrote:
>> 04-Jan-2013 10:36, Charles Hixson пишет:
>> [snip]
>>> Certainly TDPL
>>> talks about them sequentially, and implies that there is a tight
>>> connection between shared variables and synchronized classes. It is as
>>> if a synchronized class is the analog of an atomic operation acting on a
>>> larger set of data. And when I look at implementing CSMACD it looks as
>>> if only one thread can detect that it's write was interrupted, and then
>>> only after it's complete. Unless I implement a checksum scheme, in
>>> which case I guess both could detect an improperly completed write. But
>>> then independent readers couldn't detect the problem. So readers would
>>> need to be locked out while writing was in process...and we're back to
>>> mutexes/spinlocks/etc.
>>
>>> So it looks to me as if a synchronized mailbox class eliminates a
>>> multitude of problems a the lowest reasonable cost.
>>
>> Given the million of cell you are going to lock things that need not be
>> locked most of the time. It's hardly a reasonable cost.
> But is there any cost to keeping something locked that nobody's looking
> at?  There's clearly a RAM cost, of course, but any solution will have
> some RAM cost, and presumably the library implementation of synchronized
> will be at least as good as any that I could come up with.

There is a cost to (un)lock a mutex. You can make a simple program and 
measure it or just look around for some hard data on the subject.

>> And the fear of things being copied between threads, if anything threads
>> are sharing single address space (a fact of life). If you want multiple
>> processes then yes, different address space and some coping is bound to
>> happen.
> The copying is, indeed, bound to happen.  I'm trying to design this so
> that the copying will be minimized.  But I also get confused about what
> is local to the thread.  (I didn't realize until you told me a message
> or so ago that the heap was shared.)  I knew that it was in most
> languages, but apparently I got confused when TDPL said that in D
> threads did not share memory (13,3).

The memory (on hardware level) is shared either way, but semantically 
(on type-system level) everything not tagged shared is thread-local.

Stack is already private to each thread, globals are stored in TLS 
provided by the OS and things are set up in such a way that heap memory 
is also 'type-marked' as thread-local but in fact it's a type-cast away 
from being used by some other thread. The idea was to put more onus on 
what data is passed between threads to reduce threading errors and undue 
sharing of data.

>>>>
>>>>> I wanted to know if this proposed design would work, as in not getting
>>>>> into deadlocks, not blocking excessively, and not giving me
>>>>> excessively
>>>>> stale data.
>>>>
>>>> The the crucial part is missing - taking a message out of the
>>>> mailbox ;)
>>> The only entity permitted to take a message from the mailbox is the cell
>>> to which it is attached, and that happens at each initiation of
>>> activation. But that's one reason that the mailbox should be
>>> synchronized.
>>
>> Then here is the thing - if unlucky enough a thread that sends will get
>> blocked by the thread that reads messages. It's not deadlock but a waste
>> of time.
> This "waste of time" appears to me to be unavoidable.  Basically the
> cell takes ownership of the message vector, and a new one is created for
> the mailbox.

I've meant that the thread should keep working on other cell not be 
blocked by this unfortunate collision. i.e. that "try lock" idea.

>>>>
>>>> In simplest case just make a locked (or lock-free) queue of these and
>>>> let threads pick cells/tasks from it and put back when they are done.
>>> A lock-free queue would do it, I think, but the only ones I know of are
>>> attached to the thread rather than to a data instance, and that WOULD
>>> lead to massive contention. And increasing contention as the number of
>>> cores (threads) increased.
>>
>> Yup, separating by ID (a round robin sort of thing) could be far better.
>> That if the workload on cells is trivially balanced by partitioning
>> cells beforehand (I bet not).
> It might be.  I'll need to think about that.  I expect the load/cell to
> be essentially random WRT id#, but I could set things up so the next
> cell is added to the next available thread.  It's more complex, but not
> hugely so.  (For the other meaning of round robin, adding cells by id#
> mod thread count would be a round robin.)

Yeah, I've though basic mod (n_threads+1).

[snip]
>>> It would be quite reasonable to make the queue bounded, and reject
>>> messages when the mailbox was full. But, again, if it's attached to the
>>> thread rather than to the object there is going to be a massive problem
>>> with contention.
>>
>> Again I don't get the attached to thread thing. TLS? Just don't put it
>> in TLS then ;)
> Perhaps the problem is that I don't know how to implement a lock-free
> queue that's thread safe?  I looked at the version in (IIRC)
> std.concurrency because that's the one I found in the library.  Queue's
> I can do, but when lock-free and thread-safe are added, I'm well beyond
> my depth.

Queue in std.concurrency AFAIK is not lock-free, there were floating 
thoughts of making it such. Either way it's a good idea to copy some 
well-analyzed work then to try to roll your own with lock-free stuff :)

[snip]
>>> That should be a
>>> very fast design, but I worry about how much RAM space would be
>>> required.
>>
>> Yup, but the RAM will eat your design anyway if messages are that big
>> and the amount of cells is millions. The only sensible way to scale this
>> up is having a cluster of cheap machines each running a bunch of cells
>> and communicating over fast network (that's still very slow compared to
>> memory access).
> Yeah, but I'm not a company.  I've got what I've got (or what I replace
> it with...but it would be replace, not add to.  Space is also a
> constraint.)

The other way to get out of memory usage & variable queues is the 
following that is if you can fit it, and that's a big if. The idea is 1) 
That queues are intentionally quite small
2) When cell tries to put message into a full queue of other cell, it 
saves this 'suspended' state (right before send, *this* could be hard) 
and keeps the message waiting (1 per cell max)
3) Thread switches over to the cell with full queue
(this implies threads need to lock cells or otherwise avoid 2 threads 
working on one cell)
4) When (some) thread reaches the cell (that was suspended) it first 
tries to continue work starting with that send.

Now it may or may not fit into your design but does 2 things: keeps data 
not stale (short queues + "busy" cells get their queue flushed more 
often) and limits memory used. The overhead is obviously in a bit of CPU 
time  to save/restore state of 'interrupted' cells.

>>> That said, the application I'm designing would probably overstress any
>>> computer I could afford. I'm almost resigned to that. I just want to
>>> come as close to reasonable speed of execution as possible, and I
>>> clearly want to avoid deadlocks and data that doesn't get properly
>>> updated.
>>>
>>
>> Then just run simulations of as big as you can and optimize speed, then
>> keep this size fixed and optimize RAM used, then over again.
>> The RAM vs speed should be more or less analyzable (and extrapolable)
>> even if tested on a select bunch of modest PCs.
> "select bunch of modest PCs" is a problem.  I've got a few old
> computers, but no space to use them in.  (Seriously, I should really get
> rid of them, but one if them is occasionally useful when the active
> machine is in the shop.)  Space is a lot more expensive than computers.

Aye, though laptops are pretty small and darn fast these days :)

> FWIW, this program I'm describing *IS* a simulation of a more complex
> system, drastically simplified already.

I deduced as much.

>  I've got less trouble
> sacrificing speed than capabilities.  But I could reduce the size
> requirements by using a restricted range of inputs.
[snip]
>> That's why I suggest to not even begin thinking about it - virtual ram
>> and OS does it far better (and with hardware assistance from CPU).
>>
>> Forget the hash table if RAM is premium, use straight arrays they have
>> the least per item overhead.
> Yes.  A hash table, while obvious, is clearly not the right idea.  I do
> find myself attracted to a variation on a B+Tree with fancy rebalancing,
> so that each node is more than 2/3 full.  (I think I could get it to
> average over 3/4 full, excluding the root node.) This, however, is
> computationally expensive.  It's main advantages are that then the limit
> to the structure is disk space, and persistence is automatically
> maintained.  It would, however, require another redesign.
>
> OTOH, *some* means of saving state persistently will be necessary, as I
> can't dedicate my computer to nothing but this program.  Also deleting
> cells that have become very inactive may prove to be difficult, and
> might be better done while the rest of the program is inactive.  (That
> would mean marking the cell for deletion during normal operation, and
> then running the garbage collector separately.  Facilitating this is one
> reason that all links need to be bidirectional.

Then also take a look at memory-mapped I/O. Though persistence would 
mean fixing up all pointer/reference stuff...

>> Also if adventurous enough to look at lock-free I suggest to read this
>> piece by Herb Sutter:
>> http://www.drdobbs.com/parallel/writing-a-generalized-concurrent-queue/211601363?pgno=2
>>
>>
>  From that link (right near the start):
>      "Like any shared variable, the next pointer needs to be
>       protected by a mutex or, as here, declared as an ordered
>       atomic type (C++0x atomic<> or Java/.NET volatile)."
> To me that sounds like synchronized, since it can't be atomic, being a
> struct.  My C++ is pretty weak, though.  I'm guessing at what is meant
> by "ordered atomic".

Structs can be atomic if size permits. Anyway it's a pointer in disguise 
thus can be atomic. Ordered atomic implies sequential consistency 
(w.r.t. of multiple parallel operations on the same memory location) + 
atomic operation.

> In D, however, I believe that 64 bits is the
> length of the longest atomic type.  This is a bit of a guess, but real
> types can't be atomic on Intel cpus, because they are 80 bits long.
>
Yup. BTW structs could be used with atomic ops provided they are simple 
enough (no destructors, no postblits etc.) and fits into specific size.

> OTOH, the last bit of TDPL (pg 426-429 of my copy) talks about lock-free
> stacks and lists.  For the mailbox I think a stack would do as well as a
> queue as it's going to be emptied entirely when the cell accesses it,
> though perhaps the list would be better, but the stack is simpler.  I'd
> need to adapt the pop routine into a "popAll", but that looks
> straight-forwards.

The "looks straight-forward" is almost always wrong with lock-free ;)

Just be careful and have your design so that you can switch the 
explosive & fast lock-free queues with good old lock-based ones.
It's painfully bad to discover memory corruption due to *some* glitch in 
the lock-free data-structure a week before shipping app into production 
(or even a week after...).

-- 
Dmitry Olshansky