AA rehash threshold

Thu Nov 20 14:45:07 PST 2014

On 11/20/14 5:30 PM, Jerry Quinn wrote:
> Steven Schveighoffer <schveiguy at yahoo.com> writes:
>
>> On 11/18/14 9:46 PM, deadalnix wrote:
>>>
>>> After all, unless your hash table is very small, the first hit is
>>> most likely a cache miss, meaning ~300 cycles. at this point the
>>> cache line is in L1, with a 3 cycle access time. So accessing
>>> several element in the same cache line is often preferable.
>>
>> In the case of AAs, all bucket elements are pointers, so this point is moot
>> for our purposes -- one may have to jump outside the cache to find the first
>> element. A good improvement (this is likely to come with a full library type)
>> is to instead store an inline element for each bucket entry instead of just a
>> pointer to an element. I recall when writing dcollections, this added a
>> significant speedup.
>
> This works nicely for small types, but has gotchas.  For example, if
> you've got an AA of ints, what value indicates that this is a value
> folded into the bucket entry vs actually being a pointer?  You'll need
> extra space to make sure it's safe.  Alignment is another concern.
> Let's say you have bool for the entry/element flag.  With ints you've
> doubled the size of the bucket array.

Hm.. the bucket entry will simply be what a node is now. You are 
actually saving space, because you don't need to store that extra 
pointer to the first node.

Where it gets dicey is when you rehash. Some of the nodes will be moved 
into the bucket space, some will move out of it. But I don't think it's 
a big deal, as a rehash doesn't happen very often.

> If you have large objects as the elements, you'll waste space.  Probably
> the implementation needs to be switchable depending on the size of the object.

Yes, this is also a concern. This almost necessitates a minimum load as 
well as a maximum.

> As an aside, I find it invaluable to have statistics about hash table
> buckets, ala C++11.  When building custom hash functions and hash table
> implementations, it's almost necessary.
>
> I'd like to see bucket stats available in some manner for built-in AAs.
> If they're going away for a library type, we should have such stat info
> as part of the API.

Absolutely all of this could be added to a library type.

-Steve