Concurrent GC (for Windows)

Fri Jun 13 01:22:23 PDT 2014

On 13.06.2014 02:38, Dmitry Olshansky wrote:
> 12-Jun-2014 10:34, Rainer Schuetze пишет:
>>
>> I implemented the QueryWorkingSetEx version like this (you need a
>> converted psapi.lib for Win32):
>
> Yes, exactly, but I forgot the recipe to convert COFF/OMF import libraries.

Grab coffimplib.exe.

>> This function
>> is not supported on XP, though.
>
> I wouldn't worry about it, it's not like XP users are growing in
> numbers. Also it looks like only 64bit version is good to go, as on
> 32bit it would reduce usable memory in half.

There could also be the fallback to VirtualQuery if QueryWorkingSetEx 
doesn't exist.

>> A short benchmark shows that VirtualQuery needs 55/42 ms for your test
>> on Win32/Win64 on my mobile i7, while QueryWorkingSetEx takes about 17
>> ms for both.
>
> Seems in line with my measurements. Strictly speaking 1/2 of pages,
> interleaved should give the estimate of the worst case. Together with
> remapping (freeing duplicated pages) It doesn't go beyond 250ms on 640Mb
> of heap.
>
>> If I add the actual copy into heap2 (i.e. every fourth page of 512 MB is
>> copied), I get 80-90 ms more.
>
> Aye... this is a lot. Also for me it turns out that unmapping CoW view
> at the last step takes the most of time.

Maybe the memory needs to be actually flushed to the file if no mapping 
exists. If that is the case, we could avoid that if we create another 
temporary mapping.

 > It might help to split the full heap into multiple views.

The current GC uses pools of max 32 MB, so that already exists.

> Also using VirtualProtect during the first step - turning a mapping into
> CoW one is faster then unmap/map (by factor of 2).
>
> One thing that may help is saving a pointer to the end of used heap at
> the moment of scan, then remaping only this portion as COW.
>

The pool architecture already does this if scanning just ignores new 
pools during collection.

Another optimization is to segregate the heap into memory with 
references and memory with plain data (NO_SCAN) at page/pool 
granularity. NO_SCAN-pages won't need COW. My hope is that this reduces 
necessary duplicate memory addresses considerably.

> Last issue I see is adjustment of pointers - in a GC, the mapped view is
> mapped at new address so it would need a fixup them during scanning.
>

Agreed, that's a slight additional cost for scanning, but I don't think 
it will be too difficult to implement.

>>
>> The numbers are not great, but I guess the usual memory usage and number
>> of modified pages will be much lower. I'll see if I can integrate this
>> into the concurrent implementation.
>
> Wish you luck, I'm still not sure if it will help.
>