Associative Array: reasonable limits?
Dmitry Olshansky
dmitry.olsh at gmail.com
Tue Nov 5 05:34:06 PST 2013
05-Nov-2013 02:20, Charles Hixson пишет:
> On 11/03/2013 01:46 AM, Dmitry Olshansky wrote:
>> 03-Nov-2013 02:37, Charles Hixson пишет:
>>> I'm contemplating an associative array that will eventually grow to be
>>> an estimated 64KB in size, assuming it's about half full. It would then
>>> be holding around 90,000 entries. Is this reasonable, or should I go
>>> with a sorted array, and do binary searches? I estimate that binary
>>> searches would average around 15 accesses/hit, but that's still a lot
>>> faster than disk access. The array would be nearly full, so there would
>>> be less wasted space, but, of course, insertions would become quite
>>> expensive. (Deletions aren't expected.) In either case the keys would
>>> be fixed length character arrays, and the values would also be simple
>>> structs without reference types (so the garbage collector could pretty
>>> much ignore things).
>>
>> 90k elements is a small AA. It would work just fine with any sane hash
>> function (e.g. the built-in one).
>>
>> At around 10M+ elements it may be worth to consider space saving
>> techniques. That said on x64 100M+ is perfectly OK, on 32bit I'd
>> advise against it but only because of GC.
>>
>>> FWIW I'm expecting this array to be around the entire time the program
>>> is running, but it would probably be rolled out most of the time, as
>>> frequently accessed items would be pulled into a much smaller structure,
>>> and if I go with the array rather than the hash table I may only do
>>> updates as a batch. (Since it's linear with the size of the array, but
>>> essentially ignores the number of items being added.)
>>>
>>> My temptation is to go with the hash table...but I'm not sure of the
>>> underlying mechanics.
>>
>> Think of it as a sparse array that has "holes" and takes around ~2-4
>> times the size of normal array. It's not only because of holes but
>> because a slot is a bit bigger.
>> In return lookups/inserts/deletions are cheap O(1) with high probability.
>>
> Thanks, I'd been assuming that it would just be around 128 bytes larger
> (i.e., two pointers) and around a 50% fill rate. Sounds like a bit
> worse than I was estimating...not hugely worse though. OTOH, that's an
> argument in favor of using a small cache that is a hash table, and a
> large array as a backing store, where infrequently accessed items could
> be kept. Probably if the cache could hold around 2,000 entries it would
> have over a 90% hit rate (though that would need to be tested).
What's your requirements for this data-structure anyway?
> And the
> array would also provide a sorted access method for traversing it. I'm
> not sure that's important, but it could be. It could also be saved to
> an externally editable file (i.e., using a standard text editor) which
> also might be important.
Oh my, that's a lot of complexity for no gain. Truth be told I tend to
always optimize for speed (=less power consumption) these days. What you
describe is sort of collision pages (like in some DB) but deployed in
such a small scenario I see ZERO gain in any dimension.
You know code is also occupying some RAM ;).
> This will, naturally, slow down insertions. (I don't expect any
> deletions, as they would render historical data unintelligible.)
>
> Perhaps I'll batch the insertions, as the insertion is linear WRT the
> size of the array, but is essentially independent of the number of
> insertions. (I.e., no matter how many insertions, it's a one pass job.)
>
Even more code. And space for batching and keeping track of if the item
is batched but not yet in collection.
--
Dmitry Olshansky
More information about the Digitalmars-d-learn
mailing list