AA rehash threshold

Jerry Quinn via Digitalmars-d digitalmars-d at puremagic.com
Thu Nov 20 14:30:54 PST 2014


Steven Schveighoffer <schveiguy at yahoo.com> writes:

> On 11/18/14 9:46 PM, deadalnix wrote:
>>
>> After all, unless your hash table is very small, the first hit is
>> most likely a cache miss, meaning ~300 cycles. at this point the
>> cache line is in L1, with a 3 cycle access time. So accessing
>> several element in the same cache line is often preferable.
>
> In the case of AAs, all bucket elements are pointers, so this point is moot
> for our purposes -- one may have to jump outside the cache to find the first
> element. A good improvement (this is likely to come with a full library type)
> is to instead store an inline element for each bucket entry instead of just a
> pointer to an element. I recall when writing dcollections, this added a
> significant speedup.

This works nicely for small types, but has gotchas.  For example, if
you've got an AA of ints, what value indicates that this is a value
folded into the bucket entry vs actually being a pointer?  You'll need
extra space to make sure it's safe.  Alignment is another concern.
Let's say you have bool for the entry/element flag.  With ints you've
doubled the size of the bucket array.

If you have large objects as the elements, you'll waste space.  Probably
the implementation needs to be switchable depending on the size of the object.

As an aside, I find it invaluable to have statistics about hash table
buckets, ala C++11.  When building custom hash functions and hash table
implementations, it's almost necessary.

I'd like to see bucket stats available in some manner for built-in AAs.
If they're going away for a library type, we should have such stat info
as part of the API.

Jerry


More information about the Digitalmars-d mailing list