Why is GC.collect `pure`

Wed Aug 2 20:21:30 UTC 2023

On Wednesday, August 2, 2023 12:02:35 PM MDT Nick Treleaven via Digitalmars-d-
learn wrote:
> On Wednesday, 2 August 2023 at 17:55:12 UTC, Nick Treleaven wrote:
> > On Wednesday, 2 August 2023 at 17:52:00 UTC, Nick Treleaven
> >
> > wrote:
> >> Now I'm wondering why those functions are marked `pure` - they
> >> must affect the GC's bookkeeping state.
>
> I guess it was because the GC's internal state is not supposed to
> be observable outside internal GC functions. I find it harder to
> accept some of those than `GC.malloc` being pure, because
> GC.disable and GC.enable will affect how long future allocations
> will take. That latency can be significant and observed by the
> program. Also conceptually they are changing GC state.

Well, affecting how long something takes doesn't have anything to do with
pure. Another process running on the box could have the same effect. Whether
a function can be pure or not is strictly a matter of whether it's possible
for it to access any non-immutable data that wasn't passed to it via its
arguments.

Of course, the whole question of pure gets weird with the GC, because we
want to be able to treat GC allocations as pure when in fact they do mutate
the GC's state when the GC was not passed in via a function argument. So,
when dealing with the GC, purity becomes a question of maintaining the
guarantees that the compiler expects with regards to pure and allocations
rather than the more straightforward question of whether the function can
access any non-immutable data from anything outside of itself via anything
other than its arguments.

In general, the GC's state is essentially treated as being separate from
that of the program itself and thus irrelevant to stuff like pure. As far as
the state of the program itself is concerned, the GC could allocate with new
and then never bother to free anything, or it could be running a collection
every single time new is called - or anything in between. As far as D is
concerned, none of that matters to the state of the actual program. It's
just a GC concern.

That being said, of course, we do need to be careful when dealing directly
with GC functions, because we don't want what the compiler does based on
pure to end up having undesirable side effects with regards to the GC. As
such, whenever deciding whether such functions can be pure or not, we need
to carefully consider what the compiler will potentially do based on pure.

So, remember that the most that the compiler will do with pure is optimize
out multiple calls to the same strongly pure function within a single
expression where each call has the exact same arguments. The compiler will
also use that information to determine whether a value might be unique or
not so that it can determine whether it's safe to convert mutable data to
immutable, but that's primarily a type system concern rather than a runtime
one. As such, the question of whether it's safe to make a GC function pure
essentially comes down to the question of what would happen if you have an
expression such as

foo(12) * foo(12)

which ends up being optimized down to one call to foo instead of two,
because foo is strongly pure. And remember that because foo is strongly
pure, its arguments are immutable (or were implicitly converted to
immutable), and thus its execution in both cases would be identical. So, the
exact same sequence of calls to GC functions would occur in each call to
foo. So, we don't have to worry about something like the GC being enabled in
one call to foo but disabled in the other.

The functions that you referred to were GC.collect, GC. minimize, GC.enable,
and GC.disable. So, the question becomes how (if at all) it affects the
state of the program itself if the number of calls to those functions
changes due to a call to foo being optimized out. And it shouldn't take much
to see that it doesn't matter.

Calling enable or disable multiple times in a row would just result in
extraneous calls that do nothing, so optimizing that down to a single call
wouldn't matter. Calling minimize multiple times would similarly not matter
at all. It's highly unlikely that multiple calls to minimize within a short
period of time would make any difference over a single call, and even if it
did, it would just be affecting how much free memory the GC had. It would
have no effect on the state of the program itself (and remember that as far
as the rest of the program is concerned, the state of the GC doesn't even
exist; the program's semantics would be the same even if new always grabbed
more memory from the OS, and collections never did anything).

Now, GC.collect is a bigger question, because that can affect when objects
are actually destroyed, which obviously can affect the state of the program
outside of the GC based on what the destructors involved do. So, that _can_
affect the program outside of memory allocations. However, when that happens
is already effectively random, and it isn't even guaranteed that it will
ever happen. It could happen on an explicit call to GC.collect, or it could
happen with a stray call to new, or the program might shut down before the
GC ever destroys any of the unreferenced objects sitting on its heap as part
of a collection.

So, whether either of those calls to foo results in an object being
destroyed really doesn't matter. The program's logic cannot reasonably rely
on objects on the GC heap ever being destroyed, because the GC doesn't
guarantee that it will ever collect that memory and destroy the objects in
question. As such, whether GC.collect gets called once or twice within that
expression is irrelevant. It _could_ have an effect on what the program
actually does, just like whether the GC runs out of memory could affect what
the program actually does, but given what the language guarantees (or
doesn't guarantee) to the program with regards to the GC, it's perfectly
fine for calls to GC.collect to be optimized out like this. And
realistically, given that they'd be occurring right next to each other, the
additional call to GC.collect would likely just slow the program down
without freeing any additional memory or destroying any additional objects
anyway. But either way, it doesn't violate the guarantees given by the
language.

So, while it does feel weird to allow pure functions to effectively mutate
global state via the GC, when you consider how the language treats the GC's
state and what the compiler actually does with pure, it should be pretty
clear that treating these functions as pure shouldn't be a problem.

- Jonathan M Davis