Associative Array: reasonable limits?

Tue Nov 5 05:34:06 PST 2013

05-Nov-2013 02:20, Charles Hixson пишет:
> On 11/03/2013 01:46 AM, Dmitry Olshansky wrote:
>> 03-Nov-2013 02:37, Charles Hixson пишет:
>>> I'm contemplating an associative array that will eventually grow to be
>>> an estimated 64KB in size, assuming it's about half full.  It would then
>>> be holding around 90,000 entries.  Is this reasonable, or should I go
>>> with a sorted array, and do binary searches?  I estimate that binary
>>> searches would average around 15 accesses/hit, but that's still a lot
>>> faster than disk access.  The array would be nearly full, so there would
>>> be less wasted space, but, of course, insertions would become quite
>>> expensive.  (Deletions aren't expected.)  In either case the keys would
>>> be fixed length character arrays, and the values would also be simple
>>> structs without reference types (so the garbage collector could pretty
>>> much ignore things).
>>
>> 90k elements is a small AA. It would work just fine with any sane hash
>> function (e.g. the built-in one).
>>
>> At around 10M+ elements it may be worth to consider space saving
>> techniques. That said on x64 100M+ is perfectly OK, on 32bit I'd
>> advise against it but only because of GC.
>>
>>> FWIW I'm expecting this array to be around the entire time the program
>>> is running, but it would probably be rolled out most of the time, as
>>> frequently accessed items would be pulled into a much smaller structure,
>>> and if I go with the array rather than the hash table I may only do
>>> updates as a batch.  (Since it's linear with the size of the array, but
>>> essentially ignores the number of items being added.)
>>>
>>> My temptation is to go with the hash table...but I'm not sure of the
>>> underlying mechanics.
>>
>> Think of it as a sparse array that has "holes" and takes around ~2-4
>> times the size of normal array. It's not only because of holes but
>> because a slot is a bit bigger.
>> In return lookups/inserts/deletions are cheap O(1) with high probability.
>>
> Thanks, I'd been assuming that it would just be around 128 bytes larger
> (i.e., two pointers) and around a 50% fill rate.  Sounds like a bit
> worse than I was estimating...not hugely worse though.  OTOH, that's an
> argument in favor of using a small cache that is a hash table, and a
> large array as a backing store, where infrequently accessed items could
> be kept.  Probably if the cache could hold around 2,000 entries it would
> have over a 90% hit rate (though that would need to be tested).

What's your requirements for this data-structure anyway?

> And the
> array would also provide a sorted access method for traversing it.  I'm
> not sure that's important, but it could be.  It could also be saved to
> an externally editable file (i.e., using a standard text editor) which
> also might be important.

Oh my, that's a lot of complexity for no gain. Truth be told I tend to 
always optimize for speed (=less power consumption) these days. What you 
describe is sort of collision pages (like in some DB) but deployed in 
such a small scenario I see ZERO gain in any dimension.
You know code is also occupying some RAM ;).

> This will, naturally, slow down insertions. (I don't expect any
> deletions, as they would render historical data unintelligible.)
>
> Perhaps I'll batch the insertions, as the insertion is linear WRT the
> size of the array, but is essentially independent of the number of
> insertions.  (I.e., no matter how many insertions, it's a one pass job.)
>

Even more code. And space for batching and keeping track of if the item 
is batched but not yet in collection.

-- 
Dmitry Olshansky