The D standard library is built on GC, is that a negative or positive?

Wed Dec 14 01:18:59 UTC 2022

On Tue, Dec 13, 2022 at 07:11:34AM +0000, thebluepandabear via Digitalmars-d wrote:
> Hello,
> 
> I was speaking to one of my friends on D language and he spoke about
> how he doesn't like D language due to the fact that its standard
> library is built on top of GC (garbage collection).
> 
> He said that if he doesn't want to implement GC he misses out on the
> standard library, which for him is a big disadvantage.
> 
> Does this claim have merit? I am not far enough into learning D, so I
> haven't touched GC stuff yet, but I am curious what the D community
> has to say about this issue.

1) No, this claim has no merit.  However, I sympathize with the reaction
because that's the reaction I myself had when I first found D online. I
came from a strong C/C++ background, got fed up with C++ and was looking
for a new language closer to my ideals of what a programming language
should be.  Stumbled across D, which caught my interest.  Then I saw the
word "GC" and my knee-jerk reaction was, "what a pity, the rest of the
language looks so promising, but GC? No thanks."  It took me a while to
realize the flaw in my reasoning.  Today, I wholeheartedly embrace the
GC.

2) Your friend has incomplete/inaccurate information about the standard
library being dependent on the GC.  A pretty significant chunk of Phobos
is actually usable without the GC -- a large part of the range-based
stuff (std.range, std.algorithm, etc.), for example.  True, some parts
are GC-dependent, but you can still get pretty good mileage out of the
nogc subset of Phobos.

//

The thing about GC vs. non-GC is that, coming from a C/C++ background,
my philosophy was that I must be in control of every detail of my
program; I had to know exactly what it does at any given point.
Especially when it comes to managing memory allocations. The idea being
that if I kept my memory tidy (i.e., free allocated chunks when I'm done
with them) then there wouldn't be an accumulation of garbage that would
cost a lot of time to clean up later. The idea of a big black box called
the GC that I don't understand, randomly taking over management of my
memory, scared me.  What if it triggered a collection at an inconvenient
time when performance is critical?

Not an entirely wrong line of reasoning, but manual memory management
comes with costs:

a) The biggest cost is the additional mental load it adds to your
programming tasks.  Once you go beyond your trivial hello-world and
add-two-numbers-together type of functions, you have to start thinking
about memory management at every turn, every juncture. "My function
needs space to sort this list of stuff, hmm, I need to allocate a
buffer. How big of a buffer do I need?  When should I allocate it? When
should I free it?  I also need this other scratchpad buffer for caching
this other bit of data that I'll need 2 blocks down the function body.
Better allocate it too.  Oh no, now I have to free it, so both branches
of the if-statement has to check the pointer and free it.  Oh, and
inside this loop too; I can't just short-circuit it by returning from
the function, I need an exit block for cleaning up my allocations.  Oh,
but this function might be called from a performance-critical part of
the code!  Better not do allocations here, let the caller pass it in.
Oh wait, that changes the signature of this function, so I can't put it
in the generic table of function pointers to callbacks anymore, I need a
control block to store the necessary information.  Oh wait, I have to
allocate the control block too. Who's gonna free it? When should it be
freed?"

And on and on it goes.  Pretty soon, you find yourself spending an
inordinate amount of time and effort fiddling with memory management
rather than making progress in the problem domain, i.e., actually
solving the problem you set out to solve in the first place.

And worse yet:

b) Your APIs become cluttered with memory management paraphrenalia.
Instead of only input parameters that are directly related to the
problem domain the function is supposed to do work in, you must also
include memory-management related stuff.  Like allocators, wrapped
pointers -- because nobody can keep track of raw pointers without
eventually tripping up -- you better wrap it in a managed pointer like
auto_ptr<> or some ref-counted handle.  But should you use auto_ptr or
ref_counted<> or something else?  In a large project, some functions
will expect auto_ptr, others will expect ref_counted<>, and when you
need to put them together, you need to insert additional code for
interconverting between your wrapped pointer types. (And you need to
take extra care not to screw up the semantics and leak/corrupt memory.)

The net result is, memory management paraphrenalia percolates throughout
your code, polluting every API and demanding extra code for
interconverting / gluing disparate memory management conventions
together.  Extra code that don't help you make any progress in your
problem domain, but have to be there because of manual memory
management.

c) So you went through all of the above troubles because you believed
that it would save you from the bogeyman of unwanted GC pauses and keep
you in control of the inner workings of your program.  But does it
really live up to its promises?  Not necessarily.

If you have a graph of allocated objects, for example, when the last
reference to some node in that graph is going out of scope, then you
have to deallocate the entire graph.  The dtor must recursively traverse
the entire structure and destruct everything, because after that point,
you no longer have a reference to the graph, and would leak the memory
if you didn't clean up now.  And here's the thing: in a sufficiently
complex program, (1) you cannot predict the size of this graph -- it's
potentially unbounded; and (2) you cannot predict where in the code the
last reference will go out of scope (when the refcount goes to 0, if
you're using refcounting).  The net result is: your program will
unpredictably get to a point where it must spend an unbounded amount of
time to deallocate a large graph of allocated objects.

IOW, this is not that much different from the GC having to pause and do
a collection at an unpredictable time.

So you put in all this effort just to avoid this bogeyman, and lo and
behold you haven't got rid of it at all!

Furthermore, on today's CPU architectures that have cache hierarchies
and memory access prediction units, one very important factor of
performance is locality. I.e., if your program accesses memory in a
sequential pattern, or within close proximity to each other, your
program tends to run faster, than if it had to successively access
multiple random locations in memory.  If you manage memory yourself,
then when a large graph of objects is going out of scope you're forced
to clean it up right there and then -- even if the nodes happen to be
widely scattered across memory (because they were allocated at different
times in the program and attached to the graph).  If you used a GC,
however, the GC could change the order in which it scans for garbage in
a way that has better cache utility -- because the GC isn't obligated to
clean up immediately, but can wait until there's enough garbage that a
single sweep would pick up pieces of diverse object graphs that happen
to be close to each other in memory, and clean them up in a sequential
order so that there are less CPU cache misses.

Or, to put it succinctly, the GC can sometimes outperform your manual
management of memory!

d) Lastly, memory management is hard.  Very hard.  So hard that, after
how many decades of industry experience with manual memory management in
C/C++, well-known, battle-worn large software projects are still riddled
with memory management bugs that lead to crashes and security exploits.
Just check the CVE database, for example.  An inordinately large
proportion of security bugs are related to memory management.

Using a GC immediately gets rid of 90% of these issues. (Not 100%,
unfortunately, because there are still cases where problems may arise.
See: "memory management is hard".)  If you don't need to write the code
that frees memory, then by definition you cannot introduce bugs while
doing so.

This leads us to the advantages of having a GC:

1) It greatly reduces the number of memory-related bugs in your program.
Gets rid of an entire class of bugs related to manually managing your
allocations.

2) It frees up your mental resources to make progress in your problem
domain, instead of endlessly worrying about the nitty-gritty of memory
management at every turn.  More mental resources available means you can
make progress in your problem domain faster, and with lower chances of
bugs.

3) Your APIs become cleaner.  You no longer need memory management
paraphrenalia polluting your APIs; your parameters can be restricted to
only those that are required for your problem domain and nothing else.
Cleaner APIs lead to less boilerplate / glue code for interfacing
between APIs that expect different memory management schemes (e.g.,
converting between auto_ptr<> and ref_counted<> or whatever). Diverse
modules become more compatible with each other, and can call each other
with less friction.  Less friction means shorter development times, less
bugs, and better maintainability (code without memory management
paraphrenalia is much easier to read -- and understand correctly so that
you can make modifications without introducing bugs).

4) In some cases, you may even get better runtime performance than if
you manually managed everything.

//

And as a little footnote: D's GC does not run in the background
independently of your program's threads; GC collections will NOT trigger
unless you're allocating memory and the GC runs out of memory to give
you.  Meaning that you *do* have some control over GC pauses in your
program -- if you want to be sure you have no collections in some piece
of code, simply don't do any allocations, and collections won't start.

If you're worried that another thread might trigger a collection, you
can always bring out the GC.stop() hammer to stop the GC from doing any
collections even in the face of continuing allocations. (And then call
GC.start() later when it's safe for collections to run again.)

And if you're like me, and you like more control over how things are run
in your program, you can even call GC.stop() and then periodically call
GC.collect() in your own schedule, at your own convenience. (In one of
my D projects, I managed to eke out a 20-25% performance boost just by
reducing the frequency of GC collections by running GC.collect on my own
schedule.)

//

Also, in those few places in your code where the GC really *does* get in
your way, there's @nogc at your disposal.  The compiler will statically
enforce zero GC usage in such functions, so that you can be sure you
won't trigger any collections and you won't make any new GC allocations.

//

So you see, the GC isn't really *that* bad, as if it were a plague that
you have to avoid at all costs.  It's actually a good helper if you know
how to make use of its advantages.

T

-- 
Why waste time reinventing the wheel, when you could be reinventing the engine? -- Damian Conway