Are Gigantic Associative Arrays Now Possible?
dlangPupil via Digitalmars-d
digitalmars-d at puremagic.com
Fri Mar 24 09:59:00 PDT 2017
On Friday, 24 March 2017 at 06:30:25 UTC, H. S. Teoh wrote:
> You have to keep in mind that one downside of hashtables is
> that they tend to be unfriendly towards caching hierarchies
...
>
> For certain applications, where key lookups are more-or-less
> random, this is the best you could do, but for a good number of
> applications, there tends to be some correlation between
> lookups, and even things like B-trees could potentially do
> better because of their cache-friendliness (and locality), even
> if you assume I/O is much faster than your traditional
> spindle-based hard drive. The O(log n) could have a much
> smaller constant than the hashtable O(1) if lookups tend to be
> correlated (i.e., exhibit locality), and lots more cache misses
> are incurred in the latter.
>
> Having said that, though, large-capacity low-latency SSDs are
> very interesting to me because I'm interested in certain
> applications that involve very large lookup tables. Currently
> I'm just using traditional hashtables, but once you get to a
> certain size, I/O overhead dominates and progress grinds to a
> halt. I've been researching ways of taking advantage of
> locality to ease this, but so far haven't gotten it to actually
> work yet.
>
>
> T
Thanks T for the great insight. Very helpful! Nice to see that
you, Laeeth, the ACM, Intel and Micron all agree: this new
technology could be very disruptive.
In addition to prevalence of random lookups, certain additional
app conditions and designs might make "gigantic AAs" more useful
or competitive with alternatives:
1. Use by programmers who need a data structure that's easier to
reason about or faster and safer to implement or prototype, and
easier for future users to read and maintain.
2. Apps with large tables that use (or could use) non-natural
keys, e.g., crypto-quality GUIDs, which can't benefit from (or
impose the extra computational cost of) the sorting that must be
performed and maintained on natural key values (e.g.,
LastName="Smith") in order to use a B-Tree search.
3. Tables (and related tables) whose row data (and even
aggregate data) could be "clumped" (think "documentized") to
achieve a singularity (and not merely some degree of locality),
thus consolidating multiple lookups into a single lookup.
-In other words, a key's value is itself an array of data that
can be fetched in a single lookup.
-The key id for each "next" (updated) version of a document
could also be stored with the current version.
-The next key-value entry to hold the update could be created
immediately but not have its values written unless and until the
prior version of the document has changed.
-Better yet in an append-only design, creation of the next
key-value entry could be deferred until a revision actually
occurs.
-Such "chained documentization" would automate histories, and
could be further abstracted and adapted (e.g., with columnstore
concepts) to accommodate apps in which lookups aren't mostly
random.
Finally, re: caches: I haven't found whether it is or isn't
possible to combine a server's DRAM and its Optane SSD RAM or
DIMMs to form a single pool of RAM. Mixing DIMMS of different
speeds is a no-no; but if this could be done, then the hottest
data could be "cached" in DRAM and thus spared the 10x latency
penalty of the SSD device.
More information about the Digitalmars-d
mailing list