getHash inconsistency

Thu Mar 22 11:41:29 PDT 2012

On Mar 20, 2012, at 10:10 PM, H. S. Teoh wrote:

> On Wed, Mar 21, 2012 at 03:55:49PM +1100, Daniel Murphy wrote:
>> "H. S. Teoh" <hsteoh at quickfur.ath.cx> wrote in message 
>> news:mailman.917.1332250604.4860.digitalmars-d at puremagic.com...
> [...]
>>> Then the question is, what should be the fix?
>>> 
>>> Currently, char[] and string have getHash defined in
>>> rt.typeinfo.ti_Ad.TypeInfo_Aa, which appears to be a custom hashing
>>> algorithm, whereas const(char)[] uses TypeInfo_Array.getHash, which
>>> uses rt.util.hash.hashOf. Which one should be used?
> [...]
>> 
>> I assume the custom hashing routine is better/faster?  If so, that
>> one.  I'd assume that getHash not working the same for const strings
>> is an oversight. 
> [...]
> 
> Here's the current hashing code for char[] and string:
> 
>        foreach (char c; s)
>            hash = hash * 11 + c;
> 
> For const(char)[], it's rt.util.hash.hashOf, which is Paul Hsieh's
> SuperFastHash algorithm with very good hash distribution properties. It
> does seem to involve a lot more operations that the simple loop above,
> though; so I assume the above simple loop was chosen because hashing
> strings are a very common operation and, for the most part, only need a
> simple hash function.
> 
> So I'm kinda leaning towards SuperFastHash, but worried about whether it
> will cause performance degradation, in which case we should stick with
> the simple loop.

The thing with hashes is that you usually gain more by getting a good distribution than you lose by computing the hash.  Our AA implementation should probably store the computed hash value per node if it isn't already though, so we can compare against that before doing opEquals when looking up a node, and so it's available when rehashing.