[OT] Efficient file structure for very large lookup tables?

Wed Dec 18 07:59:21 PST 2013

On Wed, Dec 18, 2013 at 03:03:20PM +0100, Craig Dillabaugh wrote:
> On Tuesday, 17 December 2013 at 20:50:23 UTC, H. S. Teoh wrote:
> >On Tue, Dec 17, 2013 at 08:47:17PM +0100, Craig Dillabaugh wrote:
> 
> >>Another question is can your queries be batched?  If that is the
> >>case and your data is bigger than your available memory, then try
> >>Googling "Lars Arge Buffer Tree" which might work well.
> >
> >Hmm. I'm not sure if it's possible to batch the queries unless I
> >multithread the algorithm; the queries are generated on-the-fly.  But
> >they *are* somewhat predictable (spatial locality: see below), so
> >maybe that might help?
> >
> I should clarify a bit what I meant by batched.  If you have a
> sequence of queries q_1, q_2, .... q_n (where I guess in your
> case a query can be/is an insertion), then by 'batch' proccess
> what I mean is you do not need to get the answers on the fly. It
> is sufficient to have all queries answered when the program
> terminates?

Hmm. This is an interesting thought. In a sense, queries don't have to
be answered immediately; the program can generate a series of queries
and then continue processing the next item, the answer can be
asynchronous. When the answer becomes available, then further processing
will happen (add new items to be processed, update existing items,
etc.). The only requirement is that queries will always be answered in
the order they were made (the correctness of the algorithm depends on
this).

In a sense, what I need is really just a single operation
"add-or-update", which can be processed in bulk. Either an item is
already in the table, in which case it gets updated in a certain way, or
it isn't, in which case it is added. In both cases, some additional
processing takes place after the table is updated, which may create more
items to be processed.

> Note that these techniques still account for the 'order' of the
> queries, so the answer to query q_i will be reported correctly
> (ie. same as if the queries were performed sequentially on a
> standard data structure).

Then this is OK.

> Also, if you are still considering using hashing the following
> paper may be of interest to you (I haven't read it in detail, but
> it doesn't look too crazy)
> 
> http://arxiv.org/abs/1104.2799

Thanks for the references! I'll look into this, it seems quite relevant.

T

-- 
You have to expect the unexpected. -- RL