Are Gigantic Associative Arrays Now Possible?

Fri Mar 24 09:59:00 PDT 2017

On Friday, 24 March 2017 at 06:30:25 UTC, H. S. Teoh wrote:

> You have to keep in mind that one downside of hashtables is 
> that they tend to be unfriendly towards caching hierarchies
...
>
> For certain applications, where key lookups are more-or-less 
> random, this is the best you could do, but for a good number of 
> applications, there tends to be some correlation between 
> lookups, and even things like B-trees could potentially do 
> better because of their cache-friendliness (and locality), even 
> if you assume I/O is much faster than your traditional 
> spindle-based hard drive. The O(log n) could have a much 
> smaller constant than the hashtable O(1) if lookups tend to be 
> correlated (i.e., exhibit locality), and lots more cache misses 
> are incurred in the latter.
>
> Having said that, though, large-capacity low-latency SSDs are 
> very interesting to me because I'm interested in certain 
> applications that involve very large lookup tables.  Currently 
> I'm just using traditional hashtables, but once you get to a 
> certain size, I/O overhead dominates and progress grinds to a 
> halt.  I've been researching ways of taking advantage of 
> locality to ease this, but so far haven't gotten it to actually 
> work yet.
>
>
> T

Thanks T for the great insight.  Very helpful!  Nice to see that 
you, Laeeth, the ACM, Intel and Micron all agree: this new 
technology could be very disruptive.

In addition to prevalence of random lookups, certain additional 
app conditions and designs might make "gigantic AAs" more useful 
or competitive with alternatives:

1.  Use by programmers who need a data structure that's easier to 
reason about or faster and safer to implement or prototype, and 
easier for future users to read and maintain.

2.  Apps with large tables that use (or could use) non-natural 
keys, e.g., crypto-quality GUIDs, which can't benefit from (or 
impose the extra computational cost of) the sorting that must be 
performed and maintained on natural key values (e.g., 
LastName="Smith") in order to use a B-Tree search.

3.  Tables (and related tables) whose row data (and even 
aggregate data) could be "clumped" (think "documentized") to 
achieve a singularity (and not merely some degree of locality), 
thus consolidating multiple lookups into a single lookup.

	-In other words, a key's value is itself an array of data that 
can be fetched in a single lookup.

	-The key id for each "next" (updated) version of a document 
could also be stored with the current version.

	-The next key-value entry to hold the update could be created 
immediately but not have its values written unless and until the 
prior version of the document has changed.

	-Better yet in an append-only design, creation of the next 
key-value entry could be deferred until a revision actually 
occurs.

	-Such "chained documentization" would automate histories, and 
could be further abstracted and adapted (e.g., with columnstore 
concepts) to accommodate apps in which lookups aren't mostly 
random.

Finally, re: caches: I haven't found whether it is or isn't 
possible to combine a server's DRAM and its Optane SSD RAM or 
DIMMs to form a single pool of RAM.  Mixing DIMMS of different 
speeds is a no-no; but if this could be done, then the hottest 
data could be "cached" in DRAM and thus spared the 10x latency 
penalty of the SSD device.