GC hangs on spinlock in ConservativeGC.runLocked
Steven Schveighoffer
schveiguy at gmail.com
Mon Feb 20 22:11:47 UTC 2023
On 2/20/23 4:28 PM, klasbo wrote:
> I don't quite understand the internals of the GC well enough to "get"
> what happens here. Is it that an assert in sweep() triggers, and this[1]
> scope(failure) in fullcollect() would re-trigger a GC collection when it
> tries to allocate the trace info (which is now fixed when trace info is
> nogc)?
So it was quite a subtle bug.
The throw handler will look at the `info` member of the Throwable (the
trace info), and if it is non-null, will attempt to allocate a trace info.
If you threw inside the GC (i.e. when the GC is collecting), this means
you *cannot* allocate. The trace info allocator sneakily was using the
GC, and so it had a case where it checked the `inFinalizer` flag, and if
that was true, it would simply not allocate traceinfo (leave it at null).
In the GC collection routine, there is a mechanism that catches
`Exception`, and if so, handles it properly by throwing a FinalizerError
(this has its `info` set to `SuppressTraceInfo`, which prevents further
traceinfo allocations).
But `Error` is *not* caught. It leaks all the way out to a
`scope(failure)` statement.
Now, inside this `scope(failure)` statement, the `inFinalizer` flag is
set to false. What does this do?
a. catches the Error (which remember, still has the null trace info)
b. Sets the flag to false
c. *rethrows the Error*. This sees a null traceinfo, and tries to allocate
d. Allocating tries to take the spinlock, which is still locked, and
deadlocks.
The whole reason we don't allow allocating inside the finalizer is
because of this deadlock!
The solution is to allocate the trace info with C `malloc`, and free it
with `free`. The trace info was always simply a list of stack frame
addresses, and so trivial to allocate/free.
> Or the more important (for me) question: Is this part of "normal" GC
> control flow (why would assert(freedPages < usedPages) trigger? This is
> the beyond the limit of my GC understanding!), or is there still/also
> something broken on my end that I have to look for?
The error itself is something that isn't addressed with this fix. So I
should clarify that the *dealdock* was fixed, not your original root
cause. Please try with the latest compiler, and elaborate further if you
still can't figure it out and/or file a bug!
-Steve
More information about the Digitalmars-d
mailing list