hash design problem: both string and int[] keys

Tue Nov 20 12:42:50 PST 2012

Ali Çehreli wrote:
> On 11/20/2012 11:39 AM, Charles Hixson wrote:
>
>  > I'm trying to figure out how to construct an associative array whose
>  > keys will be a combination of strings and immutable int[]'s, but every
>  > approach I've looked at has run into problems.
>
> Can you show some code?
Sorry, I'm still thinking about how to design it.  BBFMI has definite 
limitations, and that's all I could use to write code right now.
>
>  > It should be relatively
>  > easy, as strings are really just an immutable list of ints,
>
> A list of dchars is more accurate of course, but yes, dchar can be
> casted to int.
Well, longs anyway.  Which is a minor problem, as I'd prefer to use 
ints, which are smaller.
>
>  > but I
>  > haven't been able to find how strings are hashed. (It's looking as if
>  > the details are handled in C...which makes it difficult.)
>
> I bet the algorithm considers all of the characters of the strings. The
> following has been mentioned in a recent forum thread:
>
> http://eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx
For unicode strings?  This causes a problem, though "considers" gives a 
lot of room for alternatives.  For ASCII there are several plausible 
approaches, since all one needs to do is minimize collisions.  But there 
are too many plausible unicode chars for most of them to be reasonable. 
  I'd probably go for multiply by a medium size prime, add to the 
accumulator, mod the accumulator by a large prime, stepping through the 
loop.  And I'd initialize the accumulator by the length of the current 
string.  (Note that this would apply equally to strings or int[]'s.)  So 
if that's what you mean by "considers all the characters", then, yes.
>
>  > I could define
>  > my own hash code, but I don't find it at all clear what would be
>  > appropriate given that I don't know the size.
The above was a pseudo-code of the hash function I was considering.
>
> I wonder how much slow iterating over string would be, compared to
> directly accessing the elements of a dstring. (After all, strings must
> be decoded to produce dchars.)
>
> In any case, you don't need the length of the string; a loop would do:
The length isn't a problem.  It just further differentiates the 
different strings, and it's a convenient built-in function.
>
> while (!s.empty) {
> // ... use s.front and s.popFront() ...
> }
>
> Alternatively:
>
> foreach (c; stride(s, 1)) {
> // ... use c ...
> }
>
> UFCS looks better: :)
>
> foreach (c; s.stride(1)) {
> // ... use c ...
> }
Those are all good, but I might just use:
foreach (char c; lst)  {  ...  }
A hash code doesn't need to reflect the chars of the string, merely to 
separate different cases.
>
>  > The data is rather sparse,
>  > so a hash table seems appropriate. (My only other alternative is a
>  > database, and that imposes the heavy slowdown of I/O ... even though it
>  > does automate persistence it doesn't seem like a good tradeoff.)
>  >
>  > So far my best idea is to build two tables, and then look at the key
>  > when a retrieval is attempted to figure out which table it's in. That
>  > would probably work, but it feels like a kludge.
>
> How do you differentiate between the key type being int[] vs. string?
>
> Ali
>
I'd define the same access method (opIndex, opIndexAssign) with two 
different parameter types.

But as I said, that feels like a kludge.  I'm just contemplating it 
because I want rehash to work properly.  So I'd have two indexes:
Items[string]  stringKeys;
Items[int]    listKeys;
This lets the compiler handle resizing the arrays properly, at the cost 
of doubly hashing the listKeys variable.  This isn't too bad, as I 
expect all the lists to be 5 items or shorter, but it feels like a kludge.