collectNoStack should be axed

Steven Schveighoffer schveiguy at gmail.com
Sun Apr 21 19:28:11 UTC 2024


So I am helping to add a new GC to dlang, and one thing I have 
run into is that all correctly-written GC implementations must 
implement the function `collectNoStack`.

This is defined the the GC interface here: 
https://github.com/dlang/dmd/blob/c09adbbc2793aedcc3569681acfc42260d3b0e4b/druntime/src/core/gc/gcinterface.d#L59

When looking further into what this actually means, alarmingly, 
it means exactly what it says -- run a collection cycle without 
examining *any* thread stacks for roots.

What in God's name is the point of this? Won't this just collect 
things that are still *actively being referenced by threads*?! 
The answer is -- yes.

Well, I wanted to know more about how this could be valid, so I 
did more research and it's kind of a fun story. Some of this is 
conjecture as I wasn't around for the beginnings of this, and 
having help filling in the holes is appreciated.

In looking to see which code actually calls this, I can find only 
one use, here: 
https://github.com/dlang/dmd/blob/c09adbbc2793aedcc3569681acfc42260d3b0e4b/druntime/src/core/internal/gc/proxy.d#L119

Quoting the code in that snippet, so you can keep the 
"explanatory comment" in mind:

```d
// NOTE: There may be daemons threads still running when this 
routine is
//       called.  If so, cleaning memory out from under then is a 
good
//       way to make them crash horribly.  This probably doesn't 
matter
//       much since the app is supposed to be shutting down 
anyway, but
//       I'm disabling cleanup for now until I can think about it 
some
//       more.
//
// NOTE: Due to popular demand, this has been re-enabled.  It 
still has
//       the problems mentioned above though, so I guess we'll 
see.

instance.collectNoStack();  // not really a 'collect all' -- 
still scans
                             // static data area, roots, and 
ranges.
```

Which is the "collect" configuration of how to terminate the GC.

In a way, this makes sense, because if you are terminating the 
GC, the GC is going away, and it doesn't really matter if 
anything is referring to the data, those references are all gonna 
die.

OK.... but why skip just the thread stacks? In fact, why scan 
*anything at all*? I'm not the first one to think this, there's a 
second configuration, which does exactly this, which is in 
another case of that switch.

To try and pin down why this is there, and what the "popular 
demand" note means, I started using git blame (I have to say, the 
world is a better place with git and github around, I shudder to 
think how I would have had to find the history of this with 
subversion).

Aaaaand I traced it back to the *beginning of druntime*. Yes, 
this is the repository after the very first commit from Sean 
Kelly for druntime: 
https://github.com/dlang/dmd/blob/6837c0cd426f7e828aec1a2bdc941ac9b722dd14/src/gc/basic/gc.d#L73

So, I thought, maybe I will email Sean? He might know why this 
note is there.

But wait! druntime takes its lineage from Tango! And Tango is 
also on github >:)

And now, we find out when the first note was written: 
https://github.com/SiegeLord/Tango-D2/commit/03ea5067558829b8c99e3cf12bb0e55c43e29269

Hoooold on a second. The line that was commented out was... not 
the full collect. That was *already commented out*, and actually, 
it was just doing what I proposed above -- collecting all blocks 
regardless of roots.

The note was added when *that* was commented out, and apparently, 
the Tango runtime just didn't do any collection at the end of a 
program.

What about the second note? That got added "by popular demand" 
later:

https://github.com/SiegeLord/Tango-D2/commit/5984ec967eaffb1d3c1c7504e9349f18c8b36038

This means, the `_fullCollectNoStack` was added back in (and 
apparently the second call to run the destructors and clean all 
garbage, which must have been separated back then). I can guess 
because people thought it should be done.

The note concerns "deamon threads". What is a daemon thread? It's 
a thread that does not get joined at the end of execution (that 
is still the same, and you can see the explanation here: 
https://dlang.org/phobos/core_thread_threadbase.html#.ThreadBase.isDaemon). I checked, and literally this is the only place the `isDaemon` flag is used. Daemon threads still are stopped for GC, and still get scanned. They just aren't waited for at the end of main.

OK, now the note actually makes sense -- if you clean all the 
garbage at the end of main *without scanning thread stacks*, then 
you clean out memory that the daemon threads may still be using.

But.. does it? When did this *ever work*? Isn't the GC going away?

I wanted to find out the true entomology of this... "thing". So I 
kept going back. And as it turns out, the `collectNoStack` 
function comes from D1! That's right, we still have that to look 
at as well: 
https://github.com/dlang/phobos/blob/1f763bca8d8db14cd4e7af89b1667569c002361c/internal/gc/gc.d#L171

Hm.. OK, so this is what D always did. But why? I wanted to find 
out exactly what happened differently when the fullCollectNoStack 
function was called, and I got my answer:

https://github.com/dlang/phobos/blob/1f763bca8d8db14cd4e7af89b1667569c002361c/internal/gc/gcx.d#L1031

Peruse through that file, and you'll see the `nostack` variable 
is used in one place: 
https://github.com/dlang/phobos/blob/1f763bca8d8db14cd4e7af89b1667569c002361c/internal/gc/gcx.d#L2030

And look there... it's only skipping the stack scanning *if there 
is exactly one thread*.

In other words, with D1, where this poorly named 
`fullCollectNoStack` function existed, it actually would scan 
with stacks as long as you created multiple threads. That is, in 
certain (very common) cases, the `fullCollectNoStack` would scan 
stacks. Should it have been called 
`fullCollectMaybeNoStacksIfSingleThreaded`? I digress...

And in fact, when D1 was compiled in "single threaded mode", 
indeed scanning of thread stack was skipped: 
https://github.com/dlang/phobos/blob/1f763bca8d8db14cd4e7af89b1667569c002361c/internal/gc/gcx.d#L2117

Let's think back to why the heck we have this going on. My theory 
is that people who are new to GC or don't really understand how 
GC works, run a test like the following:

```d
struct S
{
    ~this() {
      printf("Destroying!\n");
    }
}

void main()
{
    S *s = new S;
}
```

If they don't get a printout, they post an angry/confused message 
on the forums saying

### Y U No work GC?

If the stack of `main` is scanned, it's possible there's still a 
reference to the `s` there. It could even still be in registers 
for the thread. And that might mean that the GC won't clean it up.

The truth is, there is no guarantee any destructors are run. And 
especially in 32-bit D (which is what D was exclusively for a 
long time), random 32-bit numbers might accidentally "point" at 
the memory block.

So maybe, the solution Walter came up with (and I'm just guessing 
here), is hey, we are shutting down anyways, just avoid scanning 
the main thread stack, and we can satisfy the unwashed masses.

But that brings us back to *WHY THE HELL DO WE STILL HAVE THIS*? 
My guess is that the note keeps people from removing it. If we 
are doing a scan at all, scanning thread stacks as roots should 
be a trivial addition to the scan. Skipping it just adds an extra 
layer of complication to the implementation that is unnecessary. 
But that note where "I'm disabling cleanup for now until I can 
think about it some more" seems to be applying to an actual scan 
(not the blunt destruction of all memory, which is the line 
commented out when the note was added). That is causing people to 
hesitate and leave things be. Someone was behind that "I", and I 
probably should step on that someone's toes, they knew what they 
were doing.

And they did, but what they did isn't what the code says (my 
hypothesis).

So my solution is, let's just get rid of this extra function. 
Let's get rid of any idea of doing a half-ass scan that at best 
collects some extra stuff that might not be referenced and at 
worst pulls the rug out from still-running threads. And if you 
actually called this somehow in the middle of a program, it will 
corrupt all your memory immediately.

I did a PR to just see what happens when we do a full scan 
instead of the "no stack" scan, and the results are pretty 
positive. I'm going to update the PR to really remove all the 
tentacles of the "nostack" variable, but I wanted to bring this 
story to light because it's too long and bizarre to explain in 
the notes of a PR.

https://github.com/dlang/dmd/pull/16401

If there are any good reasons why we should have this, or I got 
something wrong, please let me know!

-Steve


More information about the Digitalmars-d mailing list