Absolutely horrible default string hashing

dsimcha dsimcha at yahoo.com
Sun May 3 11:27:07 PDT 2009


== Quote from Steve Teale (steve.teale at britseyeview.com)'s article
> Kristian Kilpi Wrote:
> > On Sun, 03 May 2009 15:33:12 +0300, Jérôme M. Berger <jeberger at free.fr>
> > wrote:
> > > |
> > > | foreach(c; str)
> > > | {
> > > |     ret = (ret << 4) + c;
> > > | }
> > > |
> > > 	That one is very bad because it only takes into account the last
> > > few characters of the string (how many depends on the size of the
> > > hash). However, several hashing algorithms use 2^n+1 as their
> > > multiplier, which is very fast even on old/small/embedded hardware
> > > because it can be implemented as a shift and an addition.
> >
> > You are of course right; thanks for the correction. Lets scrap that
> > algorithm! :)
> Would something like
> foreach(i, c; str)
> {
>    ret ^= cast(uint) c;
>    if (i & 1)
>       ret <<= 1;
> }
> do a better job. That could cover the first 64 chars reasonably well, and I
suspect that could cover a lot of use cases.

I guess a lot of people missed my follow-up post on this.  This is largely a
typeinfo bug, not a string hashing bug.  typeid(string) returns a TypeInfo
reference to generic array TypeInfo, whereas typeid(char[]) returns a TypeInfo
reference for which the hashing works well.  What is needed is for
typeid(typemodifier(char)[]) to return the same thing as typeid(char[]).

Then again, I guess that, given how bad the generic array hashing is, it wouldn't
hurt to improve that, too.  If someone implements a decent meta-hashing function
(i.e. take a bunch of hashes of elements of some object, like an array, a tuple,
or a struct, and produce a single good hash out of these), there would be no
shortage of use cases for it.



More information about the Digitalmars-d mailing list