hashed array?

Mon Nov 12 09:44:30 PST 2012

On Mon, Nov 12, 2012 at 05:52:32PM +0100, Joseph Rushton Wakeling wrote:
> On 11/12/2012 05:27 PM, Dmitry Olshansky wrote:
> >>The current "workaround" is to just dummy it.
> >>alias bool Dummy;
> >>Dummy[string] names;
> >>names["Paul"] = true; //Ew.
> >
> >Make a wrapper?
> 
> The problem with wrapped versions of associative arrays is that they
> just don't scale with number of keys.  If I want to store (say) a
> set of 3 million size_t's, then it costs a lot more than 3_000_000 *
> size_t.sizeof() to do so this way.
>
> Do dedicated set implementations do better than this?

Probably.

For storing integer types, you could optimize for space by dividing your
key space into blocks of, say, 32 entries each. Then your hash function
will hash only the upper (n-5) bits of the key, and each hash slot will
be a bit array of 32 bits, which are indexed by the lower 5 bits of the
key. Assuming your keys are relatively densely populated, this will save
quite a significant amount of space. In the sparse case, you probably
don't waste that much space either, since each slot is only one 32-bit
int wide.

You can probably also increase the size of the slots to, say, 64 bits or
more, if you want to maximize the usage to GC allocation block overhead
ratio. Yes, each slot is only 1 int wide, but the GC does have to
maintain bookkeeping information that's probably a lot more than that.
Allocating a large number of very small blocks is generally worse than a
smaller number of medium-sized blocks.

If your keys are *very* dense, then storing ranges instead of individual
elements is the way to go (though the implementation will be
significantly more complex).

For storing sets of strings, though, you'd want something like a trie
instead.

It all depends on your specific application. So a per application
implementation might not be out of place.

> I should say that my own interests in an implementation of set are
> twofold -- first, efficient storage; and second, being able to do an
> efficient foreach across the stored values, without any concern for
> order.

With the hash + bit-array approach above, foreach would be very simple
(just walk the hash table slots and enumerate the bits in each slot).

> >Or rather call them sets and have them in the library.
> 
> The whole builtin-vs.-library thing seems a big distraction -- if
> something needs to be implemented, the library seems the natural
> place unless there's a very good reason otherwise.

Yeah, there's definitely an advantage to minimizing the built-in stuff
and keeping them in the library. Keeps the language relatively clean and
simple, but still provide convenient library types for common tasks.

T

-- 
Be in denial for long enough, and one day you'll deny yourself of things you wish you hadn't.