RFC on range design for D2

Mon Sep 8 17:24:27 PDT 2008

Sergey Gromov wrote:
> Andrei Alexandrescu <SeeWebsiteForEmail at erdani.org> wrote:
>> Walter, Bartosz and myself have been hard at work trying to find the 
>> right abstraction for iteration. That abstraction would replace the 
>> infamous opApply and would allow for external iteration, thus paving the 
>> way to implementing real generic algorithms.
> 
> opApply() wasn't my hero either. :)  Your article really looks like 
> something I'd expect to find in D.  It only requires foreach support, 
> and yeah, return by reference.

Indeed. Both are in the works.

>> I put together a short document for the range design. I definitely 
>> missed about a million things and have been imprecise about another 
>> million, so feedback would be highly appreciated. See:
>>
>> http://ssli.ee.washington.edu/~aalexand/d/tmp/std_range.html
> 
> - next operation not mentioned in section 3, forward ranges.

Thanks! Fixed.

> - the union operations look... weird.  Unobvious.  I'm too sleepy now to 
> propose anything better but I'll definitely give it a try.  The rest of 
> the interface seems very natural.

I agree I hadn't known what primitives would be needed when I sat down. 
Clearly there was a need for some since individual iterators are not 
available anymore. New ideas would be great; I suggest you validate them 
by implementing some nontrivial algorithms in std.algorithm with your, 
um, computational basis of choice :o).

> - what's a collection?  How do you get a range out of there?  Collection 
> should be a range itself, like an array.  But it shouldn't be destroyed 
> after passing it to foreach().  How to achieve this if foreach() 
> essentially uses getNext()?

These are great questions, I'm glad you asked. The way I see things, D2 
ranges can be of two kinds: owned and unowned. For example D1's ranges 
are all unowned:

auto a = new int[100];
...

This array is unowned because it going out of scope leaves the 
underlying array in place. Now consider:

scope a = new int[100];

In this case the array is owned by the current scope. Scoped data is a 
very crude form of ownership that IMHO brought more trouble than it 
solved. It's a huge hole smack in the middle of everything, and we plan 
to revisit it as soon as we can.

A better design would be to define collections that own their contents. 
For example:

Array!(int) a(100);

This time a does own the underlying array. You'd be able to get ranges 
all over it:

int[] b = a.all;

So now we have two nice notions: Arrays own the data. Ranges walk over 
that data. An array can have many ranges crawling over it. But two 
arrays never overlap. The contents of the array will be destroyed (BUT 
NOT DELETED) when a goes out of scope.

What's the deal with destroyed but not deleted? Consider:

int[] a;
if (condition) {
    Array!(int) b;
    a = b.all;
}
writeln(a);

This is a misuse of the array in that a range crawling on its back has 
survived the array itself. What should happen now? Looking at other 
languages:

1) All Java objects are unowned, meaning the issue does not appear in 
the first place, which is an advantage. The disadvantage is that scarce 
resources must be carefully managed by hand in client code.

2) C++ makes the behavior undefined because it destroys data AND 
recycles memory as soon as the array goes out of scope. Mayhem ensues.

We'd like:

1.5) Allow object ownership but also make the behavior of incorrect code 
well-defined so it can be reasoned about, reproduced, and debugged.

That's why I think an Array going out of scope should invoke destructors 
for its data, and then obliterate contents with ElementType.init. That 
way, an Array!(File) will properly close all files AND put them in a 
"closed" state. At the same time, the memory associated with the array 
will NOT be deallocated, so a range surviving the array will never crash 
unrelated code, but instead will see closed files all over.

In the case of int, there is no destructor so none of that happens. 
Surviving ranges will continue looking at the contents, which is now 
unowned.

So there is a difference in the way data with destructors and data 
without destructors is handled. I don't like that, but this is the most 
reasonably effective design I came up with so far.

About the "collection should be a range itself" mantra, I've had a 
micro-epiphany. Since D's slices so nicely model at the same time arrays 
and their ranges, it is very seductive to think of carrying that to 
other collection types. But I got disabused of that notion as soon as I 
wanted to define a less simple data structure. Consider a matrix:

auto a = BlockMatrix!(float, 3)(100, 200, 300);

defines a block contiguous matrix of three dimensions with the 
respective sizes. Now a should be the matrix AND its range at the same 
time. But what's "the range" of a matrix? Oops. As soon as you start to 
think of it, so many darn ranges come to mind.

* flat: all elements in one shot in an arbitrary order

* dimension-wise: iterate over a given dimension

* subspace: iterate over a "slice" of the matrix with fewer dimensions

* diagonal: scan the matrix from one corner to the opposite corner

I guess there are some more. So before long I realized that the most 
gainful design is this:

a) A matrix owns its stuff and is preoccupied with storage internals, 
allocation, and the such.

b) The matrix defines as many range types as it wants.

c) Users use the ranges.

For example:

foreach (ref e; a.flat) e *= 1.1;
foreach (row; a.dim(0)) row[0, 0] = 0;
foreach (col; a.dim(1)) col[1, 1] *= 5;

and so on.

Inevitably naysayers will, well, naysay: D defined a built-in array, but 
it also needs Array, so built-in arrays turned out to be useless. So how 
is that better than C++ which has pointers and vector? Walter has long 
feared such naysaying and opposed addition of user-defined array types 
to Phobos. But now I am fully prepared to un-naysay the naysayers: 
built-in slices are a superior building block to naked pointers. They 
are in fact embodying a powerful concept, that of a range. With ranges 
everything there is can be built efficiently and safely. Finally, 
garbage collection helps by ensuring object ownership while preserving 
well-definedness of incorrect code.

Andrei