How to save RAM in D programs (on zero initialized buffers)

Tue Feb 7 11:35:22 PST 2012

Hi, this is me again with some "size matters" topic. This time, it's not  
the executable size, no! Instead I want to discuss a runtime memory  
footprint and speed issue that affects everyone, and how to improve the  
situation dramatically.

In D we allocate memory through the GC, that is initialized according to  
the type's .init, which gives us a save default. In most cases this will  
result in the memory block being zeroed out, like in the case of  
allocating ubyte[] buffers. Let's assume, we have a program that allocates  
some buffers in advance, that it may not use fully. This often happens  
when the input data is much smaller than the anticipated case. So our  
memory manager should handle this situation well:
   o  zero out a memory block
   o  we probably don't need all of it

So here is a small benchmark that allocates 512 * 1 MB, first using the  
typical method: new ubyte[1024 * 1024]. The oputput is:

	** new ubyte[1024 * 1024]
	   ressource usage: +526840 KB
	   user time: +0.098s | sys. time: +0.368s

As expected we have a physical memory usage increase of ~512 MB and spent  
a considerable amount of time in the system to find free memory blocks and  
in our program to initialize the data to zero. Can we use the GC more  
directly? Let's try GC.calloc:

	** GC.calloc(1024 * 1024)
	   ressource usage: +525104 KB
	   user time: +0.089s | sys. time: +0.370s

Again, 512 MB and about the same time. Nothing gained, but my RAM is  
starting to fill up. By the way, how does a good old system call to  
'malloc' compare? That gives us a block of garbage 'initialized' data - a  
situation we left behind for good in D! So here we go with another test:

	** malloc(1024 * 1024)
	   ressource usage: +2048 KB
	   user time: +0.000s | sys. time: +0.002s

Oh nice! May I say... these 512 calls were for free? 2 MB and 0.002  
seconds ain't worth talking about. The operating system didn't actually  
allocate the memory, it merely gave us a virtual memory range to use. Only  
when we write to the memory will physical memory be bound. That's perfect  
for a generously sized buffer, right? Well... we still want it zeroed out,  
so let's initialize this data to zero with ptr[0 .. 1024 * 1024] = 0:

	** malloc(1024 * 1024) + zero out
	   ressource usage: +526336 KB
	   user time: +0.053s | sys. time: +0.366s

... and we are back at square one. With the exception, that the user time  
is notably lower. What we need is a facility that gives us lazily  
allocated zeroed out memory. And guess what, it's not too much to ask for.  
Here is 'calloc' to the rescue:

	** calloc(1, 1024 * 1024)
	   ressource usage: +2048 KB
	   user time: +0.001s | sys. time: +0.001s

How does it work? The operating system fakes the memory allocation and  
just gives us 131072 references to a special read-only memory page of  
zeroes. The semantic is copy-on-write. So we start with a view on zeroed  
out memory and get the real thing once we write into it. (Sorry, if I tell  
some of you nothing new, but I just found this out today ;) )
The question I have is, should we go and improve druntime with that  
knowledge? I'm not aware of any caveats, are there any?
Thanks for reading and the test program for Linux is in the attachment (I  
used GDC to compile).

-- Marco
-------------- next part --------------
A non-text attachment was scrubbed...
Name: memory_benchmark.d
Type: application/octet-stream
Size: 2405 bytes
Desc: not available
URL: <http://lists.puremagic.com/pipermail/digitalmars-d/attachments/20120207/6b2ab271/attachment.obj>