Is garbage detection a thing?

Sun Nov 29 21:08:01 UTC 2020

On Sunday, 29 November 2020 at 16:05:04 UTC, Mark wrote:
> I have no good understanding why "garbage collection" is a big 
> thing and why "garbage detection" is no thing (I think so).

Because it's just as expensive to do garbage detection as 
automatic garbage collection.  So if you're going to go to the 
work of detecting when something is garbage, it's basically free 
to detect it at that point.

> today there exist tools for C++ (allocator APIs for debugging 
> purposes, or Address-Sanitizer or maybe also MPX) to just 
> detect that your program tried to use a memory address that was 
> actually freed and invalidated,

Note that address sanitizer is significantly slower than most 
‘real’ GCs (such as are used by java, or others).

> why did Java and other languages not stop there but also made a 
> system that keeps every address alive as long as it is used?

> Then the debug build would use a virtual machine that uses type 
> information from compilation for garbage detection, but not 
> garbage collection.

Address sanitizer does exactly what you propose here.  The 
problem is this:

Testing cannot prove only the presence of bugs; never their 
absence.  You may run your c++ program a thousand times with 
address sanitizer enabled and get no errors; yet, your code may 
still be incorrect and contain memory errors.  Safety features in 
a language--like a GC--prevent an entire class of bugs 
definitively.

> One very minor criticism that I have is: With GC there can be 
> "semantically old data" (a problematic term, sorry) which is 
> still alive and valid, and the language gives me the feeling 
> that it is a nice system that way. But the overall behavior 
> isn't necessarily very correct, it's just that it is much 
> better than a corrupted heap which could lead to everything 
> possibly crashing soon.

The distinction here is _reachability_ vs _liveness_.

So, GC theory:

A _graph_ is type of data structure.  Imagine you have a sheet of 
paper, and on the sheet of paper you have a bunch of dots.  There 
are lines connecting some of the dots.  In graph theory, the dots 
are called nodes, and the lines are edges.  We say that nodes A 
and B are _connected_ if there is an edge going between them.  We 
also say that A is _reachable_ from B if either A and B are 
connected, or A is connected to some C, where C is reachable from 
B.  Basically, if you can reach one point from another just by 
following lines, then each is reachable from the other.

A _directed_ graph is one in which the edges have directionality. 
  Imagine the lines have little arrows at the ends.  There may be 
an edge that goes A -> B; or there may be an edge that goes B -> 
A.  Or there may be both: A <-> B.  (Or they can be unconnected.) 
  In this case, to reach one node from another, you have to follow 
the arrows.  So it may be that, starting at A, you can reach B; 
but you can't go the other way round.

The _heap_ is all the objects you've ever created.  This includes 
the objects you allocate with 'new', as well as all the objects 
you allocate from the stack and all your global variables.  
What's interesting is that we can think of the heap as a directed 
graph.  If object A contains a pointer to object B, we can think 
of that the same as way as there being an edge going from node A 
to node B.

The _root set_ is some relatively small number of heap objects 
that are always available.  Generally, this is all the global 
variables and stack-allocated objects.  The name _reachable_ is 
given to any object which is reachable from one of the root set.

It is impossible for your program to access an unreachable 
object; there's no way to get a pointer to it in the first place. 
  So it is safe for the GC to free unreachable objects.

But we can also add another category of objects: _live_ vs _dead_ 
objects.  Live objects are ones which you're actually going to 
access at some point.  Dead objects are objects that you're never 
going to access again, even if they're reachable.  If a GC could 
detect which reachable objects were dead, it would be able to be 
more efficient and use less memory...hypothetically.

The reason this distinction is important, and the reason I bring 
up graph theory, is that liveness is impossible to prove.  
Seriously: it's impossible, in the general case, for the GC to 
prove that an object is still alive.  Whereas it's trivial to 
prove reachability.

Now, it is true that there are some cases where an object is dead 
but still reachable.  The fact of the matter is that in most such 
cases, the object becomes unreachable shortly thereafter.  In the 
cases when it's not, it tends to be impractical to prove an 
object is dead.  The extra work that it would take to prove 
deadness in such cases, if it were even possible to prove, would 
make it a not worthwhile optimization.

> And when I have tested all runtime cases of my compiled 
> software, which runs slow, but quite deterministically, I will 
> go on and build the release build.
>
> And if the release build (which is faster) does not behave 
> deterministically, I would fix the "VM/Non-VM compiler" I'm 
> talking about until the release build shows the same behavior.
>
> I guess there is a way this approach could fail: Timing may 
> have influence and make the VM behave differently from the 
> Non-VM (e.g. x64).

I don't know why you're so hung up on timing.  It's easy to write 
code which isn't sensitive to timing, as long as you don't use 
threads.  That doesn't mean it's possible to test it 
exhaustively; see the above note about testing.