[dmd-concurrency] D's Memory Model

Tue Feb 9 18:14:53 PST 2010

On Tue, 09 Feb 2010 05:12:29 -0500, Fawzi Mohamed <fawzi at gmx.ch> wrote:
> On 8-feb-10, at 19:15, Robert Jacques wrote:
>> On Mon, 08 Feb 2010 06:07:36 -0500, Fawzi Mohamed <fawzi at gmx.ch> wrote:

[snip]

>> Concurrent collectors trade throughput (GC time / program time) for  
>> program responsiveness. They also often trade program correctness for  
>> latency, as they are notoriously hard to get correct for languages with  
>> mutation. While, there are some very elegance concurrent GCs for  
>> functional languages, the high level descriptions of the Boehm  
>> concurrent GC seems to be inherently racy, while Java concurrent GCs  
>> tend to be highly instrumented and lock-tastic (mark-bit locking on  
>> every ref assignment).
> that can be done with atomic ops...

It can only be done correctly using a DCAS, (not dwCAS), so isn't  
implementable on any mainstream chip. So you have to lock the mark bits on  
an object. This can be done as a spin-lock using atomic ops, but that  
doesn't change the number or frequency they have to be taken.

[snip]

>> I'd disagree in that memory partitioning should be equally difficult on  
>> either 32-bit or 64-bit platforms. Consider memory pages A B C. If  
>> thread 1 allocates some memory and gets page A, then thread 2 allocates  
>> some memory and gets page B, when thread 1 next allocates memory it'll  
>> get page C and have a discontinuous memory set (A C).
> you can use mmap or similar to reserve big ranges of memory that you  
> might never actually use (commit), so yes there is a difference. 32bit  
> you cannot afford reserving large chuncks of memory to quickly know  
> which thread owns that memory

Interesting, though the fact that this makes each pool finite would appear  
to be a design flaw to me.

>>> In general one would not expect heavy sharing between threads so for  
>>> example an explicit list of external pointers could be kept if smaller  
>>> than X.
>>
>> Do you know an example of this? Given the Java thread-local collection  
>> papers, this doesn't seem to be done in practice. Also, it seems very  
>> overhead intensive and prone to races.
>
> well not directly, but I know that for example  
> http://www.research.ibm.com/people/d/dfb/papers/Bacon03Pure.ps  keeps a  
> list of changes on a per thread basis. That example actually keeps a  
> reference count instead of a simple mark flag to avoid having to keep  
> the "externally marked" list explicitly. Probably a better approach if  
> one goes in that direction.

Ahh, that sounds more reasonable and an interesting hybrid approach to  
boot. Although, to the best of my knowledge, ref-counting GCs are not 100%  
correct given the multi-paradigm nature of the D programming language. (To  
say nothing of efficiency)

[snip]

>>> This is an advantage, but using stack variables, scope variables, good  
>>> escape analysis and automatic collection for what does not escape, a  
>>> good concurrent gc,... the gain offered by shared is not so clear cut  
>>> to me, seeing its complexity.
>>
>> First, full escape analysis is both difficult and slow in interpreted  
>> languages and impossible in compiled languages such as D. So automatic  
>> collection is out too.
>> Second, scope/stack variables are limited in use and size; for example,  
>> you can't define trees or dynamic arrays on the stack nor exceed the  
>> stack size (typically 1MB)
> sure those things alone don't solve all problems, but what does not fall  
> into them generates so much overhead wrt. a "normal" shared allocation  
> pools, parallel or concurrent GC to be worthwhile?

I assume you're talking about programmer and not implementation overhead?  
(Since the implementation overhead is definitely lower) Apple seems to  
think so. Microsoft and Intel seem to think so too. Pretty much everybody  
has basically agreed that the "normal" way has failed and we need a  
different solution. Just that nobody's sure what those solutions should  
be, so we're getting lots of different answers: lib-dispatch, OpenCL,  
CUDA, TBB, PLINQ, Erlang, etc.

>> Third, thread-local mark-sweep GCs actually equal or out perform modern  
>> concurrent-parallel-generational GCs (according to Apple)
> I fully believe this, I don't contest that purely local mark and sweep  
> is more efficient, but is it that much more efficient to justify the  
> extra complexity in the language?
>
>> , and that's in a language which supports them. Further, concurrent  
>> collectors need to access the mark bits of an object. In D this means  
>> taking the global GC lock on _every_ reference assignment and  
>> traversing a log N memory tree.
> setting the mark bits is a prime example of something that can be done  
> with an atomic op.
> I haven't understood what you mean with "taking the global GC lock on  
> _every_ reference assignment and traversing a log N memory tree" you can  
> find the root pointer and check the mark bit without lock, only when you  
> think that need to set it you need to use an atomic op (during the mark  
> phase you can just go unmarked->marked).
> You mean the operations that are needed by concurrent GC to ensure that  
> modifications don't make the GC "loose" that are in use if modified  
> while collecting?
> So basically the advantage of introducing shared is that concurrent GC  
> become cheaper and more feasible because the overhead of the mutation is  
> paid only by shared objects?

Okay, the problem I think you're having is that you're thinking of Java  
and not of a systems programming language like C, C++ or D. In Java, all  
heap memory items are objects and there are no interior pointers. So it is  
trivial to find the mark bits of a class via it's meta-data. You still  
can't update the mark bits atomically since you need a DCAS to store the  
reference while making sure the mark bits are still what you think they  
are. So you have to lock-update-unlock. Anyways, in C/C++/D you have  
structs, interior pointers and arrays, which all mean to find those  
pointer bits you have to traverse the memory tree inside the GC. So you  
need to lock it. By the way, the same thing goes for capacity, which is  
why the lookup-cache was added to enhance array append performance.
And you're right about shared meaning that you only have to paid for the  
concurrency overhead where its really needed.

[snip]

>> Fourth, Bearophile brought this up on the newsgroup; one of the biggest  
>> mistakes of Java is considered to be that every object is also a  
>> monitor. Without shared, D would have to make the same mistake.
> why? explicit locks are not an option? I use mutex,... quite often...

There are several good posts in the mailing list (both this and the main D  
one) as well as the web in general on the fallacy on using raw locks.  
Monitors are know to be much safer. I think the point of the article was  
not that monitors were used to protect objects, but that _all_ objects  
were protected by monitors, which requires a bunch of excess overhead when  
in most cases its not needed. It also leads people to shared use of any  
old class, even if that's logically incorrect.

>>> Page reuse between different local pools is indeed an issue, I would  
>>> say that with 64 bits one can simply give empty pages back to the  
>>> system if they are over a given threshold, whereas with 32 bit indeed  
>>> some more tricks (stealing dirty pages) might be in order. Please note  
>>> that stealing pages from another numa node is not something you want  
>>> to do unless it is unavoidable
>>
>> I don't see what 64-bit vs 32-bit has to do with this; the former  
>> simply lets you access more memory, it doesn't magically prevent page  
>> fragmentation, false sharing, etc.
> yes but I don't expect page fragmentation to be such a problem that you  
> need to steal dirty pages from other pools, unless memory is really  
> tight.
> false sharing is not an issue if you don't steal pages.

False sharing of local objects isn't, but false sharing of  
shared/immutable objects still is. And memory can be just as tight on a  
64-bit system as a 32-bit system; one just has the ability to see more of  
it.