collectNoStack should be axed
Steven Schveighoffer
schveiguy at gmail.com
Sun Apr 21 19:28:11 UTC 2024
So I am helping to add a new GC to dlang, and one thing I have
run into is that all correctly-written GC implementations must
implement the function `collectNoStack`.
This is defined the the GC interface here:
https://github.com/dlang/dmd/blob/c09adbbc2793aedcc3569681acfc42260d3b0e4b/druntime/src/core/gc/gcinterface.d#L59
When looking further into what this actually means, alarmingly,
it means exactly what it says -- run a collection cycle without
examining *any* thread stacks for roots.
What in God's name is the point of this? Won't this just collect
things that are still *actively being referenced by threads*?!
The answer is -- yes.
Well, I wanted to know more about how this could be valid, so I
did more research and it's kind of a fun story. Some of this is
conjecture as I wasn't around for the beginnings of this, and
having help filling in the holes is appreciated.
In looking to see which code actually calls this, I can find only
one use, here:
https://github.com/dlang/dmd/blob/c09adbbc2793aedcc3569681acfc42260d3b0e4b/druntime/src/core/internal/gc/proxy.d#L119
Quoting the code in that snippet, so you can keep the
"explanatory comment" in mind:
```d
// NOTE: There may be daemons threads still running when this
routine is
// called. If so, cleaning memory out from under then is a
good
// way to make them crash horribly. This probably doesn't
matter
// much since the app is supposed to be shutting down
anyway, but
// I'm disabling cleanup for now until I can think about it
some
// more.
//
// NOTE: Due to popular demand, this has been re-enabled. It
still has
// the problems mentioned above though, so I guess we'll
see.
instance.collectNoStack(); // not really a 'collect all' --
still scans
// static data area, roots, and
ranges.
```
Which is the "collect" configuration of how to terminate the GC.
In a way, this makes sense, because if you are terminating the
GC, the GC is going away, and it doesn't really matter if
anything is referring to the data, those references are all gonna
die.
OK.... but why skip just the thread stacks? In fact, why scan
*anything at all*? I'm not the first one to think this, there's a
second configuration, which does exactly this, which is in
another case of that switch.
To try and pin down why this is there, and what the "popular
demand" note means, I started using git blame (I have to say, the
world is a better place with git and github around, I shudder to
think how I would have had to find the history of this with
subversion).
Aaaaand I traced it back to the *beginning of druntime*. Yes,
this is the repository after the very first commit from Sean
Kelly for druntime:
https://github.com/dlang/dmd/blob/6837c0cd426f7e828aec1a2bdc941ac9b722dd14/src/gc/basic/gc.d#L73
So, I thought, maybe I will email Sean? He might know why this
note is there.
But wait! druntime takes its lineage from Tango! And Tango is
also on github >:)
And now, we find out when the first note was written:
https://github.com/SiegeLord/Tango-D2/commit/03ea5067558829b8c99e3cf12bb0e55c43e29269
Hoooold on a second. The line that was commented out was... not
the full collect. That was *already commented out*, and actually,
it was just doing what I proposed above -- collecting all blocks
regardless of roots.
The note was added when *that* was commented out, and apparently,
the Tango runtime just didn't do any collection at the end of a
program.
What about the second note? That got added "by popular demand"
later:
https://github.com/SiegeLord/Tango-D2/commit/5984ec967eaffb1d3c1c7504e9349f18c8b36038
This means, the `_fullCollectNoStack` was added back in (and
apparently the second call to run the destructors and clean all
garbage, which must have been separated back then). I can guess
because people thought it should be done.
The note concerns "deamon threads". What is a daemon thread? It's
a thread that does not get joined at the end of execution (that
is still the same, and you can see the explanation here:
https://dlang.org/phobos/core_thread_threadbase.html#.ThreadBase.isDaemon). I checked, and literally this is the only place the `isDaemon` flag is used. Daemon threads still are stopped for GC, and still get scanned. They just aren't waited for at the end of main.
OK, now the note actually makes sense -- if you clean all the
garbage at the end of main *without scanning thread stacks*, then
you clean out memory that the daemon threads may still be using.
But.. does it? When did this *ever work*? Isn't the GC going away?
I wanted to find out the true entomology of this... "thing". So I
kept going back. And as it turns out, the `collectNoStack`
function comes from D1! That's right, we still have that to look
at as well:
https://github.com/dlang/phobos/blob/1f763bca8d8db14cd4e7af89b1667569c002361c/internal/gc/gc.d#L171
Hm.. OK, so this is what D always did. But why? I wanted to find
out exactly what happened differently when the fullCollectNoStack
function was called, and I got my answer:
https://github.com/dlang/phobos/blob/1f763bca8d8db14cd4e7af89b1667569c002361c/internal/gc/gcx.d#L1031
Peruse through that file, and you'll see the `nostack` variable
is used in one place:
https://github.com/dlang/phobos/blob/1f763bca8d8db14cd4e7af89b1667569c002361c/internal/gc/gcx.d#L2030
And look there... it's only skipping the stack scanning *if there
is exactly one thread*.
In other words, with D1, where this poorly named
`fullCollectNoStack` function existed, it actually would scan
with stacks as long as you created multiple threads. That is, in
certain (very common) cases, the `fullCollectNoStack` would scan
stacks. Should it have been called
`fullCollectMaybeNoStacksIfSingleThreaded`? I digress...
And in fact, when D1 was compiled in "single threaded mode",
indeed scanning of thread stack was skipped:
https://github.com/dlang/phobos/blob/1f763bca8d8db14cd4e7af89b1667569c002361c/internal/gc/gcx.d#L2117
Let's think back to why the heck we have this going on. My theory
is that people who are new to GC or don't really understand how
GC works, run a test like the following:
```d
struct S
{
~this() {
printf("Destroying!\n");
}
}
void main()
{
S *s = new S;
}
```
If they don't get a printout, they post an angry/confused message
on the forums saying
### Y U No work GC?
If the stack of `main` is scanned, it's possible there's still a
reference to the `s` there. It could even still be in registers
for the thread. And that might mean that the GC won't clean it up.
The truth is, there is no guarantee any destructors are run. And
especially in 32-bit D (which is what D was exclusively for a
long time), random 32-bit numbers might accidentally "point" at
the memory block.
So maybe, the solution Walter came up with (and I'm just guessing
here), is hey, we are shutting down anyways, just avoid scanning
the main thread stack, and we can satisfy the unwashed masses.
But that brings us back to *WHY THE HELL DO WE STILL HAVE THIS*?
My guess is that the note keeps people from removing it. If we
are doing a scan at all, scanning thread stacks as roots should
be a trivial addition to the scan. Skipping it just adds an extra
layer of complication to the implementation that is unnecessary.
But that note where "I'm disabling cleanup for now until I can
think about it some more" seems to be applying to an actual scan
(not the blunt destruction of all memory, which is the line
commented out when the note was added). That is causing people to
hesitate and leave things be. Someone was behind that "I", and I
probably should step on that someone's toes, they knew what they
were doing.
And they did, but what they did isn't what the code says (my
hypothesis).
So my solution is, let's just get rid of this extra function.
Let's get rid of any idea of doing a half-ass scan that at best
collects some extra stuff that might not be referenced and at
worst pulls the rug out from still-running threads. And if you
actually called this somehow in the middle of a program, it will
corrupt all your memory immediately.
I did a PR to just see what happens when we do a full scan
instead of the "no stack" scan, and the results are pretty
positive. I'm going to update the PR to really remove all the
tentacles of the "nostack" variable, but I wanted to bring this
story to light because it's too long and bizarre to explain in
the notes of a PR.
https://github.com/dlang/dmd/pull/16401
If there are any good reasons why we should have this, or I got
something wrong, please let me know!
-Steve
More information about the Digitalmars-d
mailing list