hash design problem: both string and int[] keys

Charles Hixson charleshixsn at earthlink.net
Tue Nov 20 13:29:02 PST 2012


Charles Hixson wrote:
> Ali Çehreli wrote:
>> On 11/20/2012 11:39 AM, Charles Hixson wrote:
>>
>> > I'm trying to figure out how to construct an associative array whose
>> > keys will be a combination of strings and immutable int[]'s, but every
>> > approach I've looked at has run into problems.
>>
>> Can you show some code?
> Sorry, I'm still thinking about how to design it. BBFMI has definite
> limitations, and that's all I could use to write code right now.
>>
>> > It should be relatively
>> > easy, as strings are really just an immutable list of ints,
>>
>> A list of dchars is more accurate of course, but yes, dchar can be
>> casted to int.
> Well, longs anyway. Which is a minor problem, as I'd prefer to use ints,
> which are smaller.
>>
>> > but I
>> > haven't been able to find how strings are hashed. (It's looking as if
>> > the details are handled in C...which makes it difficult.)
>>
>> I bet the algorithm considers all of the characters of the strings. The
>> following has been mentioned in a recent forum thread:
>>
>> http://eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx
> For unicode strings? This causes a problem, though "considers" gives a
> lot of room for alternatives. For ASCII there are several plausible
> approaches, since all one needs to do is minimize collisions. But there
> are too many plausible unicode chars for most of them to be reasonable.
> I'd probably go for multiply by a medium size prime, add to the
> accumulator, mod the accumulator by a large prime, stepping through the
> loop. And I'd initialize the accumulator by the length of the current
> string. (Note that this would apply equally to strings or int[]'s.) So
> if that's what you mean by "considers all the characters", then, yes.
>>
>> > I could define
>> > my own hash code, but I don't find it at all clear what would be
>> > appropriate given that I don't know the size.
> The above was a pseudo-code of the hash function I was considering.
>>
>> I wonder how much slow iterating over string would be, compared to
>> directly accessing the elements of a dstring. (After all, strings must
>> be decoded to produce dchars.)
>>
>> In any case, you don't need the length of the string; a loop would do:
> The length isn't a problem. It just further differentiates the different
> strings, and it's a convenient built-in function.
>>
>> while (!s.empty) {
>> // ... use s.front and s.popFront() ...
>> }
>>
>> Alternatively:
>>
>> foreach (c; stride(s, 1)) {
>> // ... use c ...
>> }
>>
>> UFCS looks better: :)
>>
>> foreach (c; s.stride(1)) {
>> // ... use c ...
>> }
> Those are all good, but I might just use:
> foreach (char c; lst) { ... }
> A hash code doesn't need to reflect the chars of the string, merely to
> separate different cases.
>>
>> > The data is rather sparse,
>> > so a hash table seems appropriate. (My only other alternative is a
>> > database, and that imposes the heavy slowdown of I/O ... even though it
>> > does automate persistence it doesn't seem like a good tradeoff.)
>> >
>> > So far my best idea is to build two tables, and then look at the key
>> > when a retrieval is attempted to figure out which table it's in. That
>> > would probably work, but it feels like a kludge.
>>
>> How do you differentiate between the key type being int[] vs. string?
>>
>> Ali
>>
> I'd define the same access method (opIndex, opIndexAssign) with two
> different parameter types.
>
> But as I said, that feels like a kludge. I'm just contemplating it
> because I want rehash to work properly. So I'd have two indexes:
> Items[string] stringKeys;
> Items[int] listKeys;
> This lets the compiler handle resizing the arrays properly, at the cost
> of doubly hashing the listKeys variable. This isn't too bad, as I expect
> all the lists to be 5 items or shorter, but it feels like a kludge.
After a bit more thought, I couldn't use a hash function to merge the 
ints of the list, because if I did then any collision would cause the 
old value to be replaces...or would it if they didn't compare to equal? 
  Which different items wouldn't, but the *keys* would compare to equal, 
so they probably would replace, which isn't at all what I desire.  But 
if I just try to combine the ints, I can't guarantee unique 
keys...certainly not for any list longer than 2 items, as int.max * int 
would require a long, but adding a third term is a "can't do".  BigInt 
defines opCmp, but doesn't seem to define toHash.  So that seems to mean 
I need to convert the list to a string.  UGH!!

On the one hand, it allows me to get away with only using one hashtable, 
but on the other I've got this ugly key conversion...but maybe I won't 
need to convert both ways.  But it's still ugly.  Even the two table 
approach was nicer.


More information about the Digitalmars-d-learn mailing list