Associative Arrays need cleanout method or property to help
Michael Rynn
michaelrynn at optusnet.com.au
Tue Mar 23 15:10:37 PDT 2010
On Tue, 23 Mar 2010 13:41:56 -0700, Walter Bright wrote:
> Michael Rynn wrote:
>> On Mon, 22 Mar 2010 00:52:24 -0700, Walter Bright wrote:
>>
>>> bearophile wrote:
>>>> A way to know the usage patterns of D AAs is to add instrumentation
>>>> in nonrelease mode, they can save few statistical data once in a
>>>> while in the install directory of D, and then the user can give this
>>>> little txt file to Walter to tune the language :-)
>>> There's no need to tune the language. The implementation and data
>>> structure of the AAs is completely opaque to the language. The
>>> implementation is in aaA.d. Feel free to try different implementations
>>> in it!
>>
>> Well, I have.
>>
>> See the code here : http://www.dsource.org/projects/aa/browser/trunk/
>> druntime/aaA.d
>>
>> See the benchmarks and comments here :
>> http://www.dsource.org/projects/aa/ wiki/DrunTimeAA.
>>
>> The result squeezes some extra performance for integer or less sized
>> keys (about 20% faster for lookups).
>>
>> Insertions and cleanups are faster too.
>
> It's a nice implementation. There are some problems with it, though:
>
> 1. Deleted entries cannot be returned to the garbage collector pool
> unless the entire hash table is removed.
>
> 2. All the keys are stored in a single array, as are all the values.
> This requires that a contiguous chunk of memory can be found for the
> arrays when rehashing to a larger hashmap. This could make it very
> difficult for this to work when hash tables grow to be of significant
> size relative to all available memory. Note that the current
> implementation has a bucket array that has as many entries as the number
> of elements in the hashtable, but the element size is only a pointer,
> not the arbitrary size of a key or value.
>
The file aaA.d is not using the randomAA implementation. Its your very
own original AA with modifications (Added mag wheels, turbo charged
memory, specializations).
So assertion 2 is not quite true, as this is still the same underlying
algorithm as for the original AA. I actually benchmarked and tried to
improve the random AA a fair chance out of politeness, for comparison
interest, to demonstrate that its performance was not as good as the
others, and to demonstrate the builtin AA had further potential. I also
compared the tango hashmap, which is very good.
Your hashtree approach is to have a separate node is allocated for each
Key-Value association. So all the keys and values are not stored
algorithmically in a single array, and do not actually require a
contiguous large chunk of memory. There is however a intermediate heap
between the AA and the garbage collector, which allocates 4K blocks of
memory, and obviously, is unlikely to free any of those blocks until all
nodes within are released from use, unless there is specially favourable
insertion and deletion patterns.
Now the memory chunker could be chucked out, just leaving the
specialization for the integer sized keys, and that would settle
assertion 1. This is changed on only a few lines of code , and could
actually be versioned.
What would also be lost is some degree of improved insertion and cleanup
performance. And also some degree of that benchmark performance, which
cheats, by the way, by looking up keys in the very same order that they
were inserted, so that nodes stored in insertion order are likely to be
grouped in memory together, and so benefit from CPU memory cache
concurrency. I have not got around yet to changing it to a thoroughly
stirred and mixed up sequence for the lookups, which should put figures
up a bit.
> 3. Rehashing (required as the hashmap grows) means that the key/value
> arrays must be reallocated and copied. That's fine for the keys, but for
> the values this is a huge problem because the user may have dangling
> pointers into the values. (This is why Andrei started the thread
> entitled "An important potential change to the language: transitory
> ref".) I think this is currently the killer problem for randAA's
> approach.
I do not think this is true of current builtin AA, and therefore not true
of this enhanced implementation. I note the current builtin _aaGet will
locate a new node by pointer, maybe do a rehash, and then return the node
it previously relocated. Relax, this AA has no fear of nodes moving
around on rehash. They are just relinked. All done with pointers.
---
Michael Rynn
More information about the Digitalmars-d
mailing list