About built-in AAs

Wed Aug 17 10:09:11 PDT 2011

On Wednesday, August 17, 2011 08:59 Steven Schveighoffer wrote:
> On Wed, 17 Aug 2011 11:47:59 -0400, Jonathan M Davis <jmdavisProg at gmx.com>
> 
> wrote:
> > On Wednesday, August 17, 2011 10:04:21 Steven Schveighoffer wrote:
> >> On Wed, 17 Aug 2011 00:15:38 -0400, Andrei Alexandrescu
> >> 
> >> <SeeWebsiteForEmail at erdani.org> wrote:
> >> > On 8/16/11 9:29 PM, bearophile wrote:
> >> >> Walter Bright:
> >> >>>> I think there are search trees like the Red-Black ones that
> >> >>>> guarantee
> >> >>>> a O(n ln n) worst case. I am wrong?
> >> >>> 
> >> >>> Just feed it more data.
> >> >> 
> >> >> If you feed it more data, even if all items pruce collision because
> >> >> they all hash to the same bucket, if you use Red-Black trees to
> >> 
> >> handle
> >> 
> >> >> the items in the same bucket you keep having a O(n ln n) behaviour,
> >> >> that's usually fast enough. With Python and the new D AAs you instead
> >> >> get a quadratic one. This quadratic behaviour gives troubles way
> >> >> before
> >> >> the physical RAM is exhausted.
> >> >> 
> >> >> Bye,
> >> >> bearophile
> >> > 
> >> > Let's please stop this. Many of us, including yourself, noticed the
> >> > relatively poor performance of D's previous hashtables compared to
> >> 
> >> other
> >> 
> >> > languages. Switching to singly-list collision handling marked an
> >> > improvement. Now a lot of data structure designs have a worst-case
> >> 
> >> that
> >> 
> >> > makes them perform worse than others. If you worry about attacks,
> >> 
> >> please
> >> 
> >> > implement your own hashtable. If we switch back to the old
> >> > implementation, you'll complain again about D's hashtables being
> >> 
> >> slower
> >> 
> >> > than Python's, thus closing a years-long cycle.
> >> 
> >> Yes, but let's not forget the one valid request out of all of this -- if
> >> trees are no longer being used, opEquals should be used insted of opCmp.
> >> This allows more possible key types (which don't define an ordering). I
> >> think this would be a simple druntime change.
> > 
> > But then we can't change the hash table type to one that needs opCmp if
> > we
> > need to later. That might be acceptable, but it makes it so that we can't
> > transparently change the implementation again if we decide that we need
> > to.
> 
> I think that's a choice we should embrace. AFAIK, no *builtin* hash
> implementations use trees for buckets in any language I'm aware of (I'm
> sure someone will find one though :). The precedent is to require opHash
> and opEquals, not opCmp. It just makes more sense for builtin hash tables
> to allow the most possible key types it can.
> 
> Also, currently, if opCmp doesn't exist the *COMPILER MAKES ONE UP*, which
> is totally unacceptable.
> 
> So if you define opEquals and not opCmp, as bearophile points out, your
> specifically defined opEquals is not even used, and some made-up
> approximation is used instead!
> 
> It's one thing to make up opEquals, that is pretty easy to get reasonably
> right. It's something entirely different to invent an opCmp, especially
> for types which have no ordering!

I'm not necessarily arguing that we should keep requiring opCmp (and certainly 
having the compiler generate one is not good IMHO). I'm just pointing out that 
that would mean that we'd be putting further limitations on our ability to 
change the implementation later if we decide that we need to. As long as we 
think that the benefits outweight the costs, then removing the need for opCmp 
and AA's probably is what we should do. But regardless, having the compiler 
create an opCmp for you seems like highly broken behavior.

- Jonathan M Davis