Memory usage of AAs?

spir denis.spir at gmail.com
Wed Mar 30 07:44:54 PDT 2011


On 03/30/2011 03:31 PM, Steven Schveighoffer wrote:
> On Tue, 29 Mar 2011 22:20:05 -0400, Nick Sabalausky <a at a.a> wrote:
>
>> "spir" <denis.spir at gmail.com> wrote in message
>> news:mailman.2909.1301443345.4748.digitalmars-d-learn at puremagic.com...
>>> On 03/30/2011 01:24 AM, Nick Sabalausky wrote:
>>>> My understanding of hash tables is that they allocate a fixed size array
>>>> and
>>>> map keys to indicies within the range 0..predefined_length_of_the_AA.
>>>>
>>>> So I've been wondering, how many elements do D's built-in AAs have? And
>>>> what's the content of each one, just a single pointer?
>>>
>>> Each element is a data structure, often called bucket (typically a link
>>> list), storing (key:value) pairs for which the key, once hashed and
>>> modulo-ed, maps to the given index. That's why the famous O(1) lookup time
>>> for hash tables is very theoretic: the constant part holds average time
>>> for hashing which is very variable, plus another average time for linearly
>>> traversing the bucket. The latter part depends on table load factor (=
>>> number of elements / number of buckets) and proper statistical repartition
>>> of keys into buckets.
>>>
>>> The key (!) points are finding a good hash func to "linearize" said
>>> repartition (which depends on actual key-data domain, not only on their
>>> type...), but better ones rapidly eat much time (ones used in practice are
>>> rather simple & fast); and finding optimal load factor, and growth scheme.
>>> In practice, all of this tends to make hash tables an implementation
>>> nightmare (for me). I'd love to find practicle alternatives, but efficient
>>> binary trees also are complex and even more depend on kind of keys, I
>>> guess.
>>>
>>
>> Right, I know that, but that's not what I was asking. Take this hypothetical
>> implementation:
>>
>> struct Bucket(TKey, TVal)
>> {
>> ...
>> }
>>
>> enum hashTableSize = ...;
>>
>> struct Hash(TKey, TVal)
>> {
>> Bucket!(TKey, TVal)[hashTableSize] data;
>>
>> TVal get(TKey key) { ... }
>> void set(TKey key, TVal value) { ... }
>> }
>>
>> I assume that D's AA are something at least vaguely like that. My questions
>> are:
>>
>> 1. What does D use for "hashTableSize"? Or does "hashTableSize" vary? If it
>> varies, what's a typical rough ballpark size? (And just out of curiosity, if
>> it varies, is it decided at compile-time, or does it change even at
>> runtime?)
>
> It varies. The hash table size is not constant, the load factor is. The load
> factor is the number of elements in the hash divided by the number of buckets.
> You never want to fill up all the spaces, because the more full you get, the
> more chances for collisions there are. Essentially, the tricky part about
> hashing is what to do about collisions (two elements are different, but go in
> the same bucket).
>
> So what happens is when the load factor exceeds a predefined constant (e.g. in
> dcollections the load factor defaults to .75), the table "rehashes", or
> increases (usually logarithmically) the size of the array, and re-inserts all
> its elements.
>
> I believe there is a minimum array size, and things are increased from there. I
> think also you can do a manual "rehash" which should adjust the size of the
> array to match a certain load factor (below the maximum).

Yes, there is a rehash method.

> In some implementations, hashes will even shrink when the load factor goes
> below a minimum (dcollections does not do this to avoid invalidating ranges).
> There are a million different ways to implement the basic hash. The most
> complex part though, is usually the collision handling. In my algo book, there
> are at least 3 ways to handle collisions, and I think there are countless more.
> If you look up hashing on wikipedia, you'll get a much better explanation.
>
>>
>> 2. What is "sizeof(Bucket(TKey, TVal))"? And I mean the shallow size, not
>> deep size. Is it dependent on TKey or TVal? Or is it just simply a pointer
>> to the start of a linked list (and therefore "sizeof(size_t)")?
>
> Here is the AA implementation:
>
> https://github.com/D-Programming-Language/druntime/blob/master/src/rt/aaA.d
>
>  From that page, you can see that AA is your bucket (note this is runtime
> stuff, so there are no templates), and BB is your Hash struct. It looks like BB
> has an array of AA pointers.

IIRC, this is because buckets are (minimal) link lists. Cells holding key:value 
are list nodes.

> -Steve

-- 
_________________
vita es estrany
spir.wikidot.com



More information about the Digitalmars-d-learn mailing list