How to save RAM in D programs (on zero initialized buffers): Reloaded

Tue Feb 7 19:40:50 PST 2012

On Tue, Feb 7, 2012 at 5:39 PM, Marco Leise <Marco.Leise at gmx.de> wrote:
> Hi, this is me again with some "size matters" topic. This time, it's not
> the executable size, no! Instead I want to discuss a runtime memory
> footprint and speed issue that affects everyone, and how to improve the
> situation dramatically.
>
> In D we allocate memory through the GC, that is initialized according to
> the type's .init, which gives us a save default. In most cases this will
> result in the memory block being zeroed out, like in the case of
> allocating ubyte[] buffers. Let's assume, we have a program that
> allocates some buffers in advance, that it may not use fully. This often
> happens
> when the input data is much smaller than the anticipated case. So our
> memory manager should handle this situation well:
>   o  zero out a memory block
>   o  we probably don't need all of it
>
> So here is a small benchmark that allocates 512 * 1 MB, first using the
> typical method: new ubyte[1024 * 1024]. The oputput is:
>
>        ** new ubyte[1024 * 1024]
>           ressource usage: +526840 KB
>           user time: +0.098s | sys. time: +0.368s
>
> As expected we have a physical memory usage increase of ~512 MB and spent
> a considerable amount of time in the system to find free memory blocks
> and in our program to initialize the data to zero. Can we use the GC more
> directly? Let's try GC.calloc:
>
>        ** GC.calloc(1024 * 1024)
>           ressource usage: +525104 KB
>           user time: +0.089s | sys. time: +0.370s
>
> Again, 512 MB and about the same time. Nothing gained, but my RAM is
> starting to fill up. By the way, how does a good old system call to
> 'malloc' compare? That gives us a block of garbage 'initialized' data - a
> situation we left behind for good in D! So here we go with another test:
>
>        ** malloc(1024 * 1024)
>           ressource usage: +2048 KB
>           user time: +0.000s | sys. time: +0.002s
>
> Oh nice! May I say... these 512 calls were for free? 2 MB and 0.002
> seconds ain't worth talking about. The operating system didn't actually
> allocate the memory, it merely gave us a virtual memory range to use.
> Only when we write to the memory will physical memory be bound. That's
> perfect
> for a generously sized buffer, right? Well... we still want it zeroed
> out, so let's initialize this data to zero with ptr[0 .. 1024 * 1024] = 0:
>
>        ** malloc(1024 * 1024) + zero out
>           ressource usage: +526336 KB
>           user time: +0.053s | sys. time: +0.366s
>
> ... and we are back at square one. With the exception, that the user time
> is notably lower. What we need is a facility that gives us lazily
> allocated zeroed out memory. And guess what, it's not too much to ask
> for. Here is 'calloc' to the rescue:
>
>        ** calloc(1, 1024 * 1024)
>           ressource usage: +2048 KB
>           user time: +0.001s | sys. time: +0.001s
>
> How does it work? The operating system fakes the memory allocation and
> just gives us 131072 references to a special read-only memory page of
> zeroes. The semantic is copy-on-write. So we start with a view on zeroed
> out memory and get the real thing once we write into it.

Special? What do you mean by special? Most OS use Virtual Memory so
sure they can say here is a page and yet not have that page backed by
physical memory. To my knowledge, they can only do this if the
allocated memory points to an unmapped page. I doubt this is the case
in non-trivial programs.

> (Sorry, if I
> tell
> some of you nothing new, but I just found this out today ;) )
> The question I have is, should we go and improve druntime with that
> knowledge? I'm not aware of any caveats, are there any?
> Thanks for reading and the test program for Linux is in the attachment (I
> used GDC to compile).
>
> -- Marco