Revamping associative arrays

Sat Oct 17 23:59:54 PDT 2009

Andrei Alexandrescu:

I like how you leave no stone unturned. Regardless of what the "final" quality of the D2 language will be, I will always think you have done your best to improve it. I will say so to the people I know too.

I have already discussed about D AAs in the past, but I guess it was not the right moment to do.

Do we want to keep them in the language? It's easy to do an:
import std.collections: HashMap;
And a flexible language is supposed to allow for a handy syntax even for user-defined collections.
But the built-ins have a nicer syntax and a little different purpose. So I think having the built-in syntax is good.

That syntax may map to built-in logic, or to library code. Both options are acceptable, but I think having built-in logic is a little better.
But I need templated HashMaps and TreeMaps too in a std.collections.

Using templates leads to faster code, so the hash in collections can be templated. But templates slow down compilation and inflate binary size. Built-i AAs have a handy syntax, so they get used more often, even where you put only 6 items into them. So I'd like built-in AAs to be optimized for a very small number of items too. There are several ways to do this. Built-in AAs must be the most general and very flexible, easy to use, and not bug-prone. I don't need built-in AAs to be the faster ones, or the ones the use the less memory. This is an optimization, so for such more demanding needs I can use the HashMap from the collections module.

So a good thing of built-in AAs is that they don't inflate code a lot, and the compilation is fast, so they can be used everywhere in the program performance in both speed and memory isn't critical (and this happens).

I like how the current AA design never degenerate to an O(n^2) behaviour (Python dicts are much faster than the current D AAs but Python dicts in corner cases go quadratic). But this quality has the disadvantage of requiring a opCmp too to items that I want to put in an AA (because final chaining is resolved with an ordered tree), while in Python and other languages I need just an opEq and an opHash.

Regarding the iteration, I'd like the design used in Python3, where keys and values (and items, that are pairs, if you want) return a small iterable object that's a view to the dict. I may also want methods that return a true array of values/items/keys as now, because once in a while you need this too (Python3 doesn't have this, and you need to do list(d.keys())).
The keys() returns an light immutable set object, this is very good and very handy. Doing set operations is a godsend in certain kinds of code.

To delete an key-value another possible syntax is:
delete d[key]
But d.delete(key) too is acceptable.

LDC compiler is able to optimize away the double lookups if they are done nearby, so it's even possible for the "in" to return a boolean. (But I like to have both !in and !is too).

I don't use .rehash often, and it takes some time to run. So I don't know how much useful it is.

I need a .clear method too. And maybe a .copy too. AAs *must* support opEqual too, as in LDC, because I have to know when a function returns the correct AA inside unittests.
(Python dicts even support opCmp, but this is a little tricky, and less useful, so this can be omitted).
This features don't require 100 lines of code, but they make AAs quite more useful and handy.

A method like Python .get(key[, default]) is quite useful, it returns default if the key is missing. In D probably the default can't be optional.

fromkeys, update (and maybe pop and popitem too) methods can be useful, see Python docs: 
http://docs.python.org/library/stdtypes.html#dict
.update can also be written as ~, but there it works only with another AA, and it's not commutative.

I'd like to have a literal for an empty AA, I use AA!(K,V) in my dlibs.

An old idea: If the value of an AA is of type void, the AA may not allocate memory for them, so the AA becomes a set (and it needs few basic set operations, with operator overload).

I may like an AA to be false when empty.

I'd like to have a standard way to, given a pointer to a AA value, return its key. I have a function that does this in my dlibs (module "extra") but it's not officially, so it will break when the AA implementation will change. This is useful if I want to implement a richer data structure on top of built-in AAs, for example an "ordered AA" that when iterated on, yields items in the same order as the insertion order, so values are kept linked in a double linked list, to do this and keep the data structure flexible I need to know a key given a pointer to a value.

I may like a "freeze" method that turns an AA into an immutable one. Ruby has this.

A .reserve method may be useful to speed up AA creation when you know you have to add tons of pairs.

I'd really like the "default" iteration on an AA to yield its keys, instead of values as currently done. Because if I have a key I can find its value, while the opposite is not possible, so having keys is much more useful. This is true in Python too. In my dlibs all iterables and functions behave like this. The current D design is just a design mistake ad Walter was wrong on this.

Reducing the number of small memory allocations done by the AAs currently used may be positive. Making them a little more precise for the GC will be very good as D matures.

The management of AAs at compile-time can be better. Having literals that work at compile time, or compile-time functions used to create runtime AAs is positive. It can even be possible to use perfect hashes for immutable AAs built at compile time.

Bye,
bearophile