Is garbage detection a thing?
Elronnd
elronnd at elronnd.net
Sun Nov 29 21:08:01 UTC 2020
On Sunday, 29 November 2020 at 16:05:04 UTC, Mark wrote:
> I have no good understanding why "garbage collection" is a big
> thing and why "garbage detection" is no thing (I think so).
Because it's just as expensive to do garbage detection as
automatic garbage collection. So if you're going to go to the
work of detecting when something is garbage, it's basically free
to detect it at that point.
> today there exist tools for C++ (allocator APIs for debugging
> purposes, or Address-Sanitizer or maybe also MPX) to just
> detect that your program tried to use a memory address that was
> actually freed and invalidated,
Note that address sanitizer is significantly slower than most
‘real’ GCs (such as are used by java, or others).
> why did Java and other languages not stop there but also made a
> system that keeps every address alive as long as it is used?
> Then the debug build would use a virtual machine that uses type
> information from compilation for garbage detection, but not
> garbage collection.
Address sanitizer does exactly what you propose here. The
problem is this:
Testing cannot prove only the presence of bugs; never their
absence. You may run your c++ program a thousand times with
address sanitizer enabled and get no errors; yet, your code may
still be incorrect and contain memory errors. Safety features in
a language--like a GC--prevent an entire class of bugs
definitively.
> One very minor criticism that I have is: With GC there can be
> "semantically old data" (a problematic term, sorry) which is
> still alive and valid, and the language gives me the feeling
> that it is a nice system that way. But the overall behavior
> isn't necessarily very correct, it's just that it is much
> better than a corrupted heap which could lead to everything
> possibly crashing soon.
The distinction here is _reachability_ vs _liveness_.
So, GC theory:
A _graph_ is type of data structure. Imagine you have a sheet of
paper, and on the sheet of paper you have a bunch of dots. There
are lines connecting some of the dots. In graph theory, the dots
are called nodes, and the lines are edges. We say that nodes A
and B are _connected_ if there is an edge going between them. We
also say that A is _reachable_ from B if either A and B are
connected, or A is connected to some C, where C is reachable from
B. Basically, if you can reach one point from another just by
following lines, then each is reachable from the other.
A _directed_ graph is one in which the edges have directionality.
Imagine the lines have little arrows at the ends. There may be
an edge that goes A -> B; or there may be an edge that goes B ->
A. Or there may be both: A <-> B. (Or they can be unconnected.)
In this case, to reach one node from another, you have to follow
the arrows. So it may be that, starting at A, you can reach B;
but you can't go the other way round.
The _heap_ is all the objects you've ever created. This includes
the objects you allocate with 'new', as well as all the objects
you allocate from the stack and all your global variables.
What's interesting is that we can think of the heap as a directed
graph. If object A contains a pointer to object B, we can think
of that the same as way as there being an edge going from node A
to node B.
The _root set_ is some relatively small number of heap objects
that are always available. Generally, this is all the global
variables and stack-allocated objects. The name _reachable_ is
given to any object which is reachable from one of the root set.
It is impossible for your program to access an unreachable
object; there's no way to get a pointer to it in the first place.
So it is safe for the GC to free unreachable objects.
But we can also add another category of objects: _live_ vs _dead_
objects. Live objects are ones which you're actually going to
access at some point. Dead objects are objects that you're never
going to access again, even if they're reachable. If a GC could
detect which reachable objects were dead, it would be able to be
more efficient and use less memory...hypothetically.
The reason this distinction is important, and the reason I bring
up graph theory, is that liveness is impossible to prove.
Seriously: it's impossible, in the general case, for the GC to
prove that an object is still alive. Whereas it's trivial to
prove reachability.
Now, it is true that there are some cases where an object is dead
but still reachable. The fact of the matter is that in most such
cases, the object becomes unreachable shortly thereafter. In the
cases when it's not, it tends to be impractical to prove an
object is dead. The extra work that it would take to prove
deadness in such cases, if it were even possible to prove, would
make it a not worthwhile optimization.
> And when I have tested all runtime cases of my compiled
> software, which runs slow, but quite deterministically, I will
> go on and build the release build.
>
> And if the release build (which is faster) does not behave
> deterministically, I would fix the "VM/Non-VM compiler" I'm
> talking about until the release build shows the same behavior.
>
> I guess there is a way this approach could fail: Timing may
> have influence and make the VM behave differently from the
> Non-VM (e.g. x64).
I don't know why you're so hung up on timing. It's easy to write
code which isn't sensitive to timing, as long as you don't use
threads. That doesn't mean it's possible to test it
exhaustively; see the above note about testing.
More information about the Digitalmars-d-learn
mailing list