GC conservatism -- again

Wed Dec 29 12:43:28 PST 2010

On Wed, 29 Dec 2010 12:27:02 -0700, Steven Schveighoffer  
<schveiguy at yahoo.com> wrote:
> On Wed, 29 Dec 2010 14:00:17 -0500, Robert Jacques <sandford at jhu.edu>  
> wrote:
>> On Wed, 29 Dec 2010 07:37:10 -0700, Steven Schveighoffer  
>> <schveiguy at yahoo.com> wrote:
>>> On Tue, 28 Dec 2010 01:23:22 -0500, Robert Jacques <sandford at jhu.edu>  
>>> wrote:
>>>> Second, the false pointer problem disappears (for practical purposes)  
>>>> when you move to 64-bit.
>>>
>>> I'm not sure I like this "solution", but you are correct.  This is  
>>> somewhat mitigated however by the way memory is allocated (I'm  
>>> assuming not sparsely throughout the address space, and also low in  
>>> the address space).  It certainly makes it less likely that a 64-bit  
>>> random long points at data, but it's not inconceivable to have 32-bits  
>>> of 0 interspersed with non-zero data.  It might be likely to have a  
>>> struct with two ints back to back, where one int is frequently 0.
>>
>> Ah, but the GC can allocate ram in any section of the address space it  
>> wants, so it would be easy for the upper 32-bits to be always non-zero  
>> by design.
>
> huh?  How does the GC control whether you set one int to 0 and the other  
> not?

Hmm, perhaps an example would be best. Consider a 64-bit thread-local GC.  
It might allocate ram in the following pattern:
[2-bit Shared/Local][16-bit thread ID][6-bit tag][40-bit pointer]
So first, the address space is divided up into 4 regions, [11...] and  
[00...] are left free for user use (external allocation, memory-mapped  
files, etc), and [01...]/[10...] are used to denote shared/thread-local.
Next you have the thread ID, which is one way to support thread-local  
allocation/thread-local GCs.
Then you might have a 6-bit region that lock-free algorithms could use as  
a tag (and would be ignored by the shared GC), or local GCs could use for  
internal purposes.
Finally, you have a 40-bit region with what we normally think of as a  
'pointer'.

The thing to understand, is that 64-bit computers are really 40-bit  
computers, currently. And given that 40-bits will hold us until we get to  
peta-bytes, we should have 24-bits to add meta-info to our pointers for  
some time to come. So, as long as we choose this meta-info carefully, we  
can avoid common bit patterns and the associated false pointers.

>>
>>>> Third, modern GCs (i.e. thread-local GCs) can further reduce the  
>>>> false pointer issue.
>>>
>>> I'd rather have precise scanning :)  There are issues with  
>>> thread-local GCs.
>>
>> The only issue with thread-local GCs is that you can't cast to  
>> immutable and then shared the result across threads. And eventually,  
>> well have a) better ways of constructing immutable and b) a deep idup,  
>> to mitigate this.
>
> I'm talking about collection cycles -- they necessarily need to scan  
> both thread local and shared heaps because of the possibility of  
> cross-heap pointing.  Which means you gain very little for thread-local  
> heaps.  The only thing it buys you is you can assume the shared heap has  
> no pointers into local heaps, and local heaps have no pointers into  
> other local heaps.  But if you can't run a collection on just one heap,  
> there is no point in separating them.

The defining feature of thread-local GCs is that they _can_ run collection  
cycles independently from each other.

> Plus it's easy to cast without moving the data, which is not undefined  
> currently if you take the necessary precautions, but would cause large  
> problems with separate heaps.

Casting from immutable/shared would be memory valid, it's casting to  
immutable and shared where movement has to occur. As casting between  
immutable/shared and  mutable is logically invalid, I'd expect that these  
cases would be rare (once we solve the problem of safely constructing  
complex immutable data)

>>
>>> If we have the typeinfo of a memory block for the GC to parse, you can  
>>> also rule out cross-thread pointers without thread-local GCs (for  
>>> unshared data).
>>
>> Which basically creates all the problems of thread-local GC, with very  
>> few of the advantages.
>
> What are the advantages?  I'm not being sarcastic, I really don't know.
>
> -Steve

The major advantage is that they match or out-perform the modern  
generational/concurrent GCs, even when backed by conservative mark-sweep  
collectors (according to Apple). (TLGCs can be backed by modern  
collectors) Essentially, TLGCs work by separately allocating and  
collecting objects which are not-shared between threads. Since these  
collectors don't have to be thread safe, they can be more efficient in  
their implementation, and collections only have to deal with their subset  
of the heap and their own stack. This reduces pause times and false  
pointers, etc. TLGCs are inherently parallel and have interleaved pause  
times, which can greatly reduce the "embarrassing pause" effect (i.e. your  
worker thread running out of ram won't pause your GUI, or stop other  
workers from fulfilling http requests).

In D, they become even more important, since to the best of my knowledge D  
can never support generational GCs and only supports concurrent GCs on  
certain OSes. So, if D wants a competitive GC, a thread-local GC is the  
only cross-platform option I know of. (C function calls prevent most  
generational/concurrent GC algorithms from being applicable to D)