AA rehash threshold

Tue Nov 18 18:54:45 PST 2014

On 11/18/14 9:46 PM, deadalnix wrote:
> On Wednesday, 19 November 2014 at 02:16:36 UTC, Steven
> Schveighoffer wrote:
>> From all I remember about hash tables, you are supposed to minimize
>> the number of collisions when adding/looking up data. This helps keep
>> the O(1) lookup property intact.
>>
>
> That is the theory, but the real world is sometime different. It
> depends on how you resolve collisions. If you do it via some kind
> of linked list, then yes, you must remove all collisions.

This is what AAs currently do.

> But if you do it by putting the element in a slot close to its
> actual position, then thing gets different. In that case, you
> don't want to minimize collision, but to minimize the distance to
> the object, in order for it to be in the cache line.

Heh, a load factor of 4 wouldn't work too well there :)

I do recall reading about all the different cache miss handling. The 
wikipedia article I referenced has a lot of good info.

>
> After all, unless your hash table is very small, the first hit is
> most likely a cache miss, meaning ~300 cycles. at this point the
> cache line is in L1, with a 3 cycle access time. So accessing
> several element in the same cache line is often preferable.

In the case of AAs, all bucket elements are pointers, so this point is 
moot for our purposes -- one may have to jump outside the cache to find 
the first element. A good improvement (this is likely to come with a 
full library type) is to instead store an inline element for each bucket 
entry instead of just a pointer to an element. I recall when writing 
dcollections, this added a significant speedup.

-Steve