rt_finalize WTFs?

Mon Dec 5 14:34:02 PST 2011

On Mon, 05 Dec 2011 22:07:09 +0100, dsimcha <dsimcha at yahoo.com> wrote:

> == Quote from Martin Nowak (dawg at dawgfoto.de)'s article
>> I appreciate the recursion during mark, wanted to do this myself
>> sometime ago but expected a little more gain.
>
> The reason the gain wasn't huge is because on the benchmark I have that  
> involves a
> deep heap graph, sweeping time dominates marking time.  The performance  
> gain for
> the mark phase only (which is important b/c this is when the world needs  
> to be
> stopped) is ~20-30%.
>
>> Some more ideas:
>>   - Do a major refactoring of the GC code, making it less reluctant
>>     to changes. Adding sanity checks or unit tests would be great.
>>     This probably reveals some obfuscated performance issues.
>
> Not just obfuscated ones.  I've wanted to fix an obvious perf bug for  
> two years
> and haven't done it because the necessary refactoring would be  
> unbelievably messy
> and I'm too afraid I'll break something.  Basically, malloc() sets the  
> bytes
> between the size you requested and the size of the block actually  
> allocated to
> zero to prevent false pointers.  This is reasonable.  The problem is  
> that it does
> so **while holding the GC's lock**.  Fixing it for just the case when  
> malloc() is
> called by the user is also easy.  The problem is fixing it when malloc()  
> gets
> called from realloc(), calloc(), etc.
>
>>   - Add more realistic GC benchmarks, just requires adding to
>>     druntime/test/gcbench using the new runbench. The tree1 mainly
>>     uses homogeneous classes, so this is very synthesized.
>
> I'll crowdsource this.  I can't think of any good benchmarks that are <  
> a few
> hundred lines w/ no dependencies but aren't pretty synthetic.
>
Probably write a tool that can replay recorded allocations profiles.
But it will be difficult to fake internal pointers if they are not  
recorded.
If we were able to enable recording in the runtime, people having  
performance
issues with the GC could then send a file that can be used for  
optimizations.

>>   - There is one binary search pool lookup for every scanned address in
>> range.
>>     Should be a lot to gain here, but it's difficult. It needs a  
>> multilevel
>>     mixture of bitset/hashtab.
>
> I understand the problem, but please elaborate on the proposed  
> solution.  You've
> basically got a bunch of pools, each of which represents a range of  
> memory
> addresses, not a single address (so a basic hashtable is out).  You need  
> to know
> which range some pointer fits in.  How would you beat binary  
> search/O(log N) for this?
>
It is possible to use a hashtable or a bitset because each pool occupies
at least PAGESIZE.
One would need to use addresses relative to the minimum address
but that still needs 256K entries per GByte, so the maintenance
overhead is likely too big.

More promising is to put pool addresses ranges in a trie.

addr[7]          [...      .     ...]
                      /     |    \
addr[6]     [...   .   ...]    [...   .   ...]
                /   |    \         /   |   \
addr[5]     pool:8             [...   .  ...]
                               /   |   \
addr[4]                  pool:8 [....] pool:5

Lookup is bounded constant time, with at maximum 8(4) memory accesses.
Maintenance is only required when adding/removing pools, little  
sophisticated though.

-----

alias Node Trie;

struct Node
{
    union
    {
       Type           _type;
       Pool*          _pool;
       Node[16]*   _nodes16; // uses upper 3 bits per level
       ...
       Node[256]* _nodes256; // uses  full 8 bits per level
    }

    // use lower 3 bits to encode type, given guaranteed 8-byte alignment  
of malloc
    enum NTypeBits = 3;
    enum TypeMask = (1 << NTypeBits) - 1);
    enum PtrMask = ~TypeMask;

    enum Type { Empty, Pool, Nodes16, Nodes256 };
    static assert(Type.max <= TypeMask);

    Pool* getPool(void* p)
    {
        return getPoolImpl((cast(ubyte*)&p) + size_t.sizeof - 1);
    }

    Pool* getPoolImpl(ubyte* p)
    {
        Node* node = void;

        final switch (type)
        {
        case Type.Empty:
           return null;
        case Type.Pool:
           return pool;
        case Type.Nodes16:
           node = nodes16[*p >> 5];
           break;
        case Type.Nodes256:
           node = nodes256[*p];
           break;
        }

        if (node.type == Type.Pool)
            return node.pool;
        else
            return node.getPoolImpl(p - 1);
    }

    void insertPool(Pool* npool, uint level=0)
    {
        final switch (type)
        {
        case Type.Empty:
            pool = npool;

        case Type.Pool:
            Pool *opool = pool;
            if (opool != npool)
            {
                nodes16 = new Node[16];

                foreach (i; 0 .. 16)
                {
                    // non-trivial code here
                    // Needs to figure out into which subnode opool and  
npool
                    // should be inserted. They can use pool.minAddr and  
pool.maxAddr

                    nodes16[i].insertPool(opool, level + 1);
                    nodes16[i].insertPool(npool, level + 1);
                }
            }
        }

        case Type.Nodes16:
            // count empty nodes and decide whether
            // to switch to more child nodes.
        }
    }

    void removePool(Pool* npool, uint level=0)
    {
        // reverse of above
    }

    @property Type type()
    {
        return cast(Type)(_type & TypeMask);
    }

    @property void type(Type t)
    {
        assert(t <= TypeMask);
        _type |= t;
    }

    @property Pool* pool()
    {
        return cast(Pool*)(cast(size_t)_pool & PtrMask);
    }

    @property void pool(Pool* p)
    {
        _pool = p;
        type = Type.Pool;
    }

    ... for nodesN
}

-----

>>   - Reduce the GC roots range. I will have to work on this for
>>     shared library support anyhow.
>
> Please clarify what you mean by "reduce" the roots range.
>
We currently add something from etext to _end (rt.memory) as static root.
This probably contains read-only sections, data from other languages
and (unlikely) other unrelated sections.
I think some help from the compiler will be necessary to support
shared libraries anyhow, so we should be able to get a precise range.

> Thanks for the feedback/suggestions.