GC hangs on spinlock in ConservativeGC.runLocked

Steven Schveighoffer schveiguy at gmail.com
Mon Feb 20 22:11:47 UTC 2023


On 2/20/23 4:28 PM, klasbo wrote:
> I don't quite understand the internals of the GC well enough to "get" 
> what happens here. Is it that an assert in sweep() triggers, and this[1] 
> scope(failure) in fullcollect() would re-trigger a GC collection when it 
> tries to allocate the trace info (which is now fixed when trace info is 
> nogc)?

So it was quite a subtle bug.

The throw handler will look at the `info` member of the Throwable (the 
trace info), and if it is non-null, will attempt to allocate a trace info.

If you threw inside the GC (i.e. when the GC is collecting), this means 
you *cannot* allocate. The trace info allocator sneakily was using the 
GC, and so it had a case where it checked the `inFinalizer` flag, and if 
that was true, it would simply not allocate traceinfo (leave it at null).

In the GC collection routine, there is a mechanism that catches 
`Exception`, and if so, handles it properly by throwing a FinalizerError 
(this has its `info` set to `SuppressTraceInfo`, which prevents further 
traceinfo allocations).

But `Error` is *not* caught. It leaks all the way out to a 
`scope(failure)` statement.

Now, inside this `scope(failure)` statement, the `inFinalizer` flag is 
set to false. What does this do?

a. catches the Error (which remember, still has the null trace info)
b. Sets the flag to false
c. *rethrows the Error*. This sees a null traceinfo, and tries to allocate
d. Allocating tries to take the spinlock, which is still locked, and 
deadlocks.

The whole reason we don't allow allocating inside the finalizer is 
because of this deadlock!

The solution is to allocate the trace info with C `malloc`, and free it 
with `free`. The trace info was always simply a list of stack frame 
addresses, and so trivial to allocate/free.

> Or the more important (for me) question: Is this part of "normal" GC 
> control flow (why would assert(freedPages < usedPages) trigger? This is 
> the beyond the limit of my GC understanding!), or is there still/also 
> something broken on my end that I have to look for?

The error itself is something that isn't addressed with this fix. So I 
should clarify that the *dealdock* was fixed, not your original root 
cause. Please try with the latest compiler, and elaborate further if you 
still can't figure it out and/or file a bug!

-Steve


More information about the Digitalmars-d mailing list