Are Gigantic Associative Arrays Now Possible?

Thu Mar 23 23:30:25 PDT 2017

On Fri, Mar 24, 2017 at 12:27:00AM +0000, dlangPupil via Digitalmars-d wrote:
> On Thursday, 23 March 2017 at 10:27:36 UTC, Ola Fosheim Grøstad wrote:
> > 
> > Increasing the size of a hash table would be prohibitively
> > expensive.  You need a data-structure that can grow gracefully.
> 
> Hi Ola,
> 
> Are the hash tables you refer to the ones that D uses in the
> background to implement associative arrays, or the ones that a
> programmer might create using AAs?
> 
> --If the former, then perhaps the AA's hash function could be tweaked
> for terabyte-range SSD "RAM".  But even if the typical 2:1 bucket/data
> storage ratio couldn't be improved, then creating 32 TB of buckets
> would still allow 16 TB of nearly-instantly-addressable data (on a
> 4-core Xeon w/ 48 TB Optane SSD).
> 
> --If the latter, then my goal is design something that makes specific
> items randomly and instantly accessible, like a hash table, but with
> zero potential collisions.  Thanks!

How do SSD access speeds compare with L1 and L2 cache speeds?

You have to keep in mind that one downside of hashtables is that they
tend to be unfriendly towards caching hierarchies, because generally the
hash function is designed to scatter hash keys as widely and as randomly
as possible (to minimize collisions), meaning that potentially *every*
hash lookup will be a cache miss, perhaps more, depending on how
collision resolution is handled, which can be a big performance killer.

For certain applications, where key lookups are more-or-less random,
this is the best you could do, but for a good number of applications,
there tends to be some correlation between lookups, and even things like
B-trees could potentially do better because of their cache-friendliness
(and locality), even if you assume I/O is much faster than your
traditional spindle-based hard drive. The O(log n) could have a much
smaller constant than the hashtable O(1) if lookups tend to be
correlated (i.e., exhibit locality), and lots more cache misses are
incurred in the latter.

Having said that, though, large-capacity low-latency SSDs are very
interesting to me because I'm interested in certain applications that
involve very large lookup tables.  Currently I'm just using traditional
hashtables, but once you get to a certain size, I/O overhead dominates
and progress grinds to a halt.  I've been researching ways of taking
advantage of locality to ease this, but so far haven't gotten it to
actually work yet.

T

-- 
People say I'm indecisive, but I'm not sure about that. -- YHL, CONLANG