Robustness of the built in associative array

Fri Mar 24 06:16:54 PST 2006

Hello,

The latest "Implicit fn template instantiation" thread contains code 
that contains a hidden and possibly hard to find bug. Rewriting the code 
with a non-templated version for clarity:

void increment(int[char[]] map, char[] key) {
	if (auto t = (key in map))
		(*t)++;
	else
		map[key] = 1;
}

To understand the code, one first has to know if the associative array 
is a value type or a reference type.

It turns out it is neither. It is definitely not a value type. The above 
function will in most cases alter the original associative array and 
work as expected. But it is not a reference type either -- in the few 
cases where adding a new key to the AA invokes a rehash, all hell breaks 
loose. The original associative array is suddenly *corrupted*! (The way 
to make the function work as expected is to make the map an inout 
argument, but that is beside my point.)

Internally, an AA is implemented as an aaA*[] -- a dynamic array of 
pointers to aaA-nodes. The first node of the array contains metadata 
(the total number of keys), and the remaining [1..$] nodes contains the 
hash buckets. When passing an aa to a function, the aaA*[] gets passed. 
I assume the reason for this implementations is efficiency. All you need 
to do an efficient lookup is the array of buckets. But is this 
implementation comes at a cost of robustness. To make use of AA's, you 
have to follow a special rule:

- Never add an element to or rehash the AA unless you know you hold the 
only reference.

This means that even though the AA behaves as a reference type, the data 
can not be shared unless all references are readonly. I feel I must see 
this bastardization of being neither value or reference type a design 
mistake.

A simple fix is to change the AA implementation so that AAs become MyAA* 
  instead of aaA*[], with MyAA defined as:

struct MyAA {
	AANode[] buckets;
	uint numberOfNodes;
}

You get a very slight additional lookup cost (double dereferencing to 
get to the buckets), but make the AA a pure reference type without any 
caveats and problems.

My questions are:

1) Why is the D built in array type designed in such a fragile way? 
Should D not be robust language? Do users have to resort to writing 
library wrappers around D's types to get robustness? If so:

2) Why even keep the builtin AA? With the addition of an opIn() 
overload, you could get *exactly* the same functionality from a library 
implementation:

Map!(char[],int) map;

rather than:

int[char[]] map;

Cons:
  - This slight syntax difference.

Pros:
  - The library version would be transparently replaceable by other 
implementations when the data or other conditions have different 
algorithmic requirements.
  - The library implementation could easily be more efficient than the 
current built in version by taking advantage of templates.
  - The library implementation could without problems be made to work 
with types that lack opCompare.
  - The library version would not have any strange quirks and gotchas. 
(Dangerous usage cases like above)
  - Greater library/language decoupling.
  - The possibility of implementing an UnsafeMap type with high 
performance but less safety for the few cases where the performance 
really matters.

In my opinion, D's lack of protection (const arrays, etc) and robustness 
against programming mistakes is D's greatest weakness.

My contribution to the slogan thread:

"D - Programming without a lifeline" :)

/Oskar