Associative Array: reasonable limits?

Charles Hixson charleshixsn at earthlink.net
Mon Nov 4 14:20:58 PST 2013


On 11/03/2013 01:46 AM, Dmitry Olshansky wrote:
> 03-Nov-2013 02:37, Charles Hixson пишет:
>> I'm contemplating an associative array that will eventually grow to be
>> an estimated 64KB in size, assuming it's about half full.  It would then
>> be holding around 90,000 entries.  Is this reasonable, or should I go
>> with a sorted array, and do binary searches?  I estimate that binary
>> searches would average around 15 accesses/hit, but that's still a lot
>> faster than disk access.  The array would be nearly full, so there would
>> be less wasted space, but, of course, insertions would become quite
>> expensive.  (Deletions aren't expected.)  In either case the keys would
>> be fixed length character arrays, and the values would also be simple
>> structs without reference types (so the garbage collector could pretty
>> much ignore things).
>
> 90k elements is a small AA. It would work just fine with any sane hash 
> function (e.g. the built-in one).
>
> At around 10M+ elements it may be worth to consider space saving 
> techniques. That said on x64 100M+ is perfectly OK, on 32bit I'd 
> advise against it but only because of GC.
>
>> FWIW I'm expecting this array to be around the entire time the program
>> is running, but it would probably be rolled out most of the time, as
>> frequently accessed items would be pulled into a much smaller structure,
>> and if I go with the array rather than the hash table I may only do
>> updates as a batch.  (Since it's linear with the size of the array, but
>> essentially ignores the number of items being added.)
>>
>> My temptation is to go with the hash table...but I'm not sure of the
>> underlying mechanics.
>
> Think of it as a sparse array that has "holes" and takes around ~2-4 
> times the size of normal array. It's not only because of holes but 
> because a slot is a bit bigger.
> In return lookups/inserts/deletions are cheap O(1) with high probability.
>
Thanks, I'd been assuming that it would just be around 128 bytes larger 
(i.e., two pointers) and around a 50% fill rate.  Sounds like a bit 
worse than I was estimating...not hugely worse though.  OTOH, that's an 
argument in favor of using a small cache that is a hash table, and a 
large array as a backing store, where infrequently accessed items could 
be kept.  Probably if the cache could hold around 2,000 entries it would 
have over a 90% hit rate (though that would need to be tested).  And the 
array would also provide a sorted access method for traversing it.  I'm 
not sure that's important, but it could be.  It could also be saved to 
an externally editable file (i.e., using a standard text editor) which 
also might be important.
This will, naturally, slow down insertions. (I don't expect any 
deletions, as they would render historical data unintelligible.)

Perhaps I'll batch the insertions, as the insertion is linear WRT the 
size of the array, but is essentially independent of the number of 
insertions.  (I.e., no matter how many insertions, it's a one pass job.)

-- 
Charles Hixson



More information about the Digitalmars-d-learn mailing list