[D-runtime] Precise garbage collection

Mon Jul 1 01:00:30 PDT 2013

On 28.06.2013 23:28, Rainer Schuetze wrote:
> There is still the problem of manually arranged data to be solved. E.g.
> the associative array combine node-list entry, key and value into a
> single allocation, and only the single type infos are available (if at
> all). Same problem as above: how to create a combined type info at runtime?
>
> The current implementation "emplaces" the pointer bits at the
> corresponding addresses, but this needs two additional calls into the
> GC, resulting in far from optimal performance.

I have added code to the AA implementation to create the combined 
TypeInfo at runtime (storing it inside the Impl struct). This removes 
the overhead of the additional calls to gc_emplace.

Using a slightly modified testgc3 from the dmd test suite I have made 
some benchmarks. It creates 10000 AA of type uint[uint] and adds 1000 
entries to each. This is kind of a worst case for the GC because a lot 
of allocations are going on, but nothing can be collected up to the very 
end.

Doing some micro-optimizations in the GC I managed to get the precise GC 
(p_on) on par with or better than the existing GC (clean):

Desktop Core2Duo 6400 @ 2.13 GHz (Win32)

clean: peak mem = 182 MB,  17 collections in 10.816 s, overall 14.036 s
p_off: peak mem = 183 MB,  17 collections in 10.984 s, overall 14.327 s
p_on:  peak mem = 188 MB,  17 collections in  8.304 s, overall 11.940 s

p_off is the new implementation, but precise collection switched off. I 
believe the small overhead is due to most GC functions needing the type 
info as an additional parameter.

On my laptop, measurements are a bit less accurate due to the mobile CPU 
switching power state, but with disabling Turbo Boost, I get mostly 
reproducible results (best of 3 runs):

Mobile i7 2670QM @ 2.2 GHz (Win32)

clean: peak mem = 184 MB,  17 collections in  7.292 s, overall  9.648 s
p_off: peak mem = 185 MB,  17 collections in  7.734 s, overall 10.133 s
p_on:  peak mem = 190 MB,  17 collections in  6.743 s, overall  9.326 s

Mobile i7 2670QM @ 2.2 GHz (Win64)

clean: peak mem = 680 MB,  40 collections in 27.662 s, overall 30.643 s
p_off: peak mem = 680 MB,  40 collections in 29.731 s, overall 32.721 s
p_on:  peak mem = 691 MB,  40 collections in 27.793 s, overall 31.226 s

The 64-bit version needs almost 4 times as much memory because of some 
conservative assumptions for the alignment of the value in the 
(key,value) pair (unfortunately, the value type is not available in most 
functions). I'm not sure why the p_off version uses considerable more 
time collecting than the clean version. My guess is that some 
refactorings reduced the numbers of inlining going on.

When running the test in a loop, the GC should kick in, but really does 
not collect a lot in clean Win32 (most in clean win64). For 4 
iterations, I get

win32_clean:  mem = 724 MB,  41 gcs in 99.473 s, overall 108.802 s
win32_p_on:   mem = 199 MB,  20 gcs in  7.329 s, overall 17.417 s

win64_clean:  mem = 698 MB,  43 gcs in 28.317 s, overall 39.364 s
win64_p_on:   mem = 708 MB,  43 gcs in 28.273 s, overall 41.142 s