[OT] Efficient file structure for very large lookup tables?

H. S. Teoh hsteoh at quickfur.ath.cx
Tue Dec 17 17:53:00 PST 2013


On Tue, Dec 17, 2013 at 05:09:43PM -0800, Brad Roberts wrote:
> If you can afford to do a one time sort of both the sets of data,
> then you can do a very simple single pass intersect through both
> sets.  Chances are that wont work.  So, a potential alternative, use
> that approach combined with a smaller delta since last sort,
> hopefully small enough to fit in memory in a nice convenient to
> probe structure.  Periodically, as the delta grows too big, merge it
> in with the primary data file.

The problem is that the data is generated by the algorithm as it runs.
If it was known beforehand, the problem would be much simpler. :) (Well,
sortof... the reason it's generated on-the-fly is that the full data set
is far too big to store anywhere -- the algorithm only generates the
subset that is needed for it to run, and that's the subset that needs to
be stored in the table.)


> Try looking at the access patterns to see if there's something you
> can do to adjust them to make the storage problem simpler.
[...]

I'm currently researching alternative ways of partitioning n-space, that
may hopefully allow a high hit-rate caching of disk pages. When n is
large, partitioning into n-cubes has very poor cache hit rates, due to
the (counterintuitive) fact that in high dimensions most of the n-cube's
volume lies near its vertices, yet the part most likely to be looked up
by the algorithm next lies near the coordinate axes. Since the n-cube's
volume grows exponentially with n, and most of that volume lies away
from the coordinate axes, almost all of the cached page will never be
used again, which makes for very poor cache performance. (Not to mention
that buckets in the shape of n-cubes have volume that is O(2^n), which
means probably they will span many disk pages and the program will have
to scan through them all to find the entry it wants.)

If there were a way to partition n-space such that each cached page
corresponds to the shape of an n-cross (an n-dimensional octahedron),
whose volume grows only linearly with n, or some other such shape, this
would allow the bucket size to be only O(n) rather than O(2^n), and
furthermore exhibit better cache performance since the entries in the
bucket would be in the region highly likely to be looked up by the
algorithm next.


T

-- 
It always amuses me that Windows has a Safe Mode during bootup. Does that mean that Windows is normally unsafe?


More information about the Digitalmars-d mailing list