Associative Array: reasonable limits?

Wed Nov 6 10:10:18 PST 2013

06-Nov-2013 00:36, Charles Hixson пишет:
> On 11/05/2013 05:34 AM, Dmitry Olshansky wrote:
>> 05-Nov-2013 02:20, Charles Hixson пишет:
>>> On 11/03/2013 01:46 AM, Dmitry Olshansky wrote:
>>>> 03-Nov-2013 02:37, Charles Hixson пишет:
>>>>> I'm contemplating an associative array that will eventually grow to be
>>>>> an estimated 64KB in size, assuming it's about half full. It would
>>>>> then
>>>>> be holding around 90,000 entries.  Is this reasonable, or should I go
>>>>> with a sorted array, and do binary searches?  I estimate that binary
>>>>> searches would average around 15 accesses/hit, but that's still a lot
>>>>> faster than disk access.  The array would be nearly full, so there
>>>>> would
>>>>> be less wasted space, but, of course, insertions would become quite
>>>>> expensive.  (Deletions aren't expected.)  In either case the keys
>>>>> would
>>>>> be fixed length character arrays, and the values would also be simple
>>>>> structs without reference types (so the garbage collector could pretty
>>>>> much ignore things).
>>>>
>>>> 90k elements is a small AA. It would work just fine with any sane hash
>>>> function (e.g. the built-in one).
>>>>
>>>> At around 10M+ elements it may be worth to consider space saving
>>>> techniques. That said on x64 100M+ is perfectly OK, on 32bit I'd
>>>> advise against it but only because of GC.
>>>>
>>>>> FWIW I'm expecting this array to be around the entire time the program
>>>>> is running, but it would probably be rolled out most of the time, as
>>>>> frequently accessed items would be pulled into a much smaller
>>>>> structure,
>>>>> and if I go with the array rather than the hash table I may only do
>>>>> updates as a batch.  (Since it's linear with the size of the array,
>>>>> but
>>>>> essentially ignores the number of items being added.)
>>>>>
>>>>> My temptation is to go with the hash table...but I'm not sure of the
>>>>> underlying mechanics.
>>>>
>>>> Think of it as a sparse array that has "holes" and takes around ~2-4
>>>> times the size of normal array. It's not only because of holes but
>>>> because a slot is a bit bigger.
>>>> In return lookups/inserts/deletions are cheap O(1) with high
>>>> probability.
>>>>
>>> Thanks, I'd been assuming that it would just be around 128 bytes larger
>>> (i.e., two pointers) and around a 50% fill rate.  Sounds like a bit
>>> worse than I was estimating...not hugely worse though.  OTOH, that's an
>>> argument in favor of using a small cache that is a hash table, and a
>>> large array as a backing store, where infrequently accessed items could
>>> be kept.  Probably if the cache could hold around 2,000 entries it would
>>> have over a 90% hit rate (though that would need to be tested).
>>
>> What's your requirements for this data-structure anyway?

Still no answer. I'm asking it only because designing any data-structure 
w/o answering this first is an exercise in futility.

It seems like priority is lookup speed and data size. Search for succint 
data-structures if you really want to dive that deep. I'd recommend it 
only for fun, not for the task at hand.

> I'm not sure about the batch adds.  I certainly won't do it at first,
> while the data is small.  The thing is, I want to keep as much as
> feasible rolled out most of the time.  That means that while it's active
> in "RAM" that RAM is actually pages on the disk of virtual memory.  With
> a large hash table, that's not likely to be controllable.  What I'm
> doing here is optimizing for "speed and..." where and includes active
> memory use (as opposed to rolled out virtual memory).  A part of what
> makes a hash table efficient is that it's always in RAM.

The moment you start using data it all gets in RAM pretty soon 
(potentially with a few of page faults). After that point you may as 
well forget about page system and virtual memory.

The thing is that number of page faults is environment dependent anyway 
an is "a constant cost" compared to any operations on data.

> If you need to
> roll it in to access it, then you lose most of the efficiency.
> A sorted
> array is a pretty straightforwards data structure.  It's reasonably
> quick to access, though not as fast as a hash table.
> The real killer is
> insertion (or deletion).  And it doesn't lose anywhere near as much by
> being rolled out (for one thing, each page holds about twice as much
> data, so you don't need as many reads).

I doubt you'll ever be able to measure this effect on top of OS induced 
noise. Thing is OS will keep pages forever in RAM anyway if there is 
plenty of it. And if it's paging on every call to your tiny dataset - 
you have far-far bigger problems anyway.

> And it's easy to save to
> external disk.  (You can size it easily so that when starting up the
> program again, the array is initialized at the correct size.)

*Yawn* Anyhow you'd serialize it in some format. Maybe even text ;)

>
> I had actually for a long time considered just using a database, but I
> can't control how the database caches things in RAM.  I want the cache
> to have a long term usage memory as well as recent use, and doing that
> with a database is just too complex.  Yes, this is sort of like a
> database, but except for some databases that store everything in RAM,
> it's not the same.

There is this moment through out your post - you seem to require some 
implementation details but have no high-level goals specified. It's kind 
of backwards.

> OTOH, if this were the entire program, you would be right that just
> using a hash table was a better idea.  But this is actually almost a
> side show, rather like GUI's are sideshows to what the program is
> actually doing.  This is what the program needs to interface
> intelligibly with people.  As such, it's not a heavily used part, and
> it's best if it doesn't consume excess active RAM.  (For that matter, I
> *know* my computer is too small for what I'm trying to do.  So using
> extra RAM is undesirable. I may eventually decide to have the thing that
> I'm currently calling "the array" live out on disk, but it's my
> understanding that this isn't much of a gain over having it just rolled
> out by virtual memory.)

>  I've also seriously considered using a sort of
> database (I'm thinking of KyotoCabinet), but doing it this way yields
> fewer external dependencies, and the hash cache is probably a better
> choice.  (Again, with KyotoCabinet I don't have as much control over
> what it decides to keep in the cache.)

> I expect that the total memory
> needs of the program will edge towards the Terabyte range, so most of it
> is going to need to live on disk no matter what I do.

Quoting you before that:

 > I'm contemplating an associative array that will eventually grow to be
 > an estimated 64KB in size, assuming it's about half full.  It would 
then be holding around 90,000 entries.

It's anyway sooo tiny compared to the rest you'd better focus on getting 
the main program efficient. Even saving say save all of that RAM (assume 
you cache weights 0 bytes) you'd save 64Kb of RAM. Now compared with at 
least 1G of ram (or what have you to work with said 1Tb?) it's 64K/1G or 
0.006% of RAM gained.

Anyhow I'd spent that time learning how memory mapped I/O works instead 
of sweating over that 0.006% (that are absolute maximum you'd never 
achieve btw). All in all the fruit to optimize is where the potential 
gain is large, something that is proportional to your dataset.

> But I don't know
> in detail just what's going to be involved there.  I've roughed out a
> data file based around fixed length records addressable by id# which can
> be transformed into a file offset, but how to decide what should be
> removed from RAM isn't something I've figured out yet.  You could think
> of the basic datastructure of the program as a 40 dimensional cyclic
> graph with one-way links, but that's not quite right.  The links carry
> more information than is typical of graph structures (though logically
> that's no change...you could model them as nodes along the links that
> act as filters). OTOH, the file structure that supports this is quite
> simple.  Just a direct access file with fixed length blocks.

> As for "to be added" being separate, that's not hard either.  That could
> be either a hash table or an array. though a hash table would probably
> be simpler.  And it's just another place I'd need to check when looking
> to see if I already recognized the code.  Still, that's something that
> should really be saved until later, as it can be added on at any time.
> (Which probably means I'll never get around to it.)

What I can say is it that there is never enough time to optimize every 
bit of system. Spend time wisely.

-- 
Dmitry Olshansky