Iterator straw-man

Tue Nov 7 09:31:14 PST 2006

C++ supports pointers as iterators largely for compatibility with 
existing/old code and because of the prevalence of pointer use in the 
language.  D however, is not burdened with the need for backwards 
compatibility, nor is pointer use at all common.  Languages without 
pointers typically implement iterators as a single object that is itself 
aware of the end of the referenced sequence.  But while this works quite 
well for sequential operations, it falls apart somewhat for random 
access operations, particularly in languages lacking operator overloading.

Unlike C++, D contains a fairly robust built-in dynamic array type. 
These arrays may be passed by value efficiently and contain both 
built-in property methods and user extension of property semantics using 
the following convention:

     void mutate( char[] c, int i );
     char[] c;
     c.mutate( 0 );

D also supports a slice syntax which is roughly comparable to a 
self-contained random access iterator.  Thus "abcdef"[0 .. 2] returns a 
dynamic array type supporting all the typical array operations, 
including the length property for determining an end point.  Slicing is 
supported for user-defined types as well, so this method may be applied 
to any container supporting random access.

Given the presence of D's slice syntax, it would be a shame to ignore it 
in favor of pointer-style random access iterators for algorithm use. 
And while the idea of a single iterator design for all access methods 
(sequential and random access) seems vaguely appealing, it relies 
heavily on traits templates to allow proper function overloading for 
different iterator categories.  In fact, overloading template functions 
for different iterator categories is so annoying that the C++ committee 
is adding concept checking to C++ 0x to simplify the implementation of 
such overloaded template functions.

I believe D's slice syntax offers a perfect opportunity to provide a 
reasonable, intuitive syntax for robust "smart" random-access iterators. 
  The built-in array type will work as-is, and a user-defined random 
access container could support the same syntax by providing opIndex, 
opIndexAssign, opSlice, and opSliceAssign for both the container and 
iterator types.  Obtaining an iterator on a random access container, 
then, would be done via a slice operation:

     auto i = myCont[0 .. $];
     foreach( e; i ) {} // sequential access
     auto e = i[0];     // random access

The remaining iterator categories in C++ are essentially variations of 
forward iterators and bidirectional iterators, and could roughly be 
described as requiring a means of advancing, testing whether the 
iterator references a valid element, and of accessing the referenced 
element.  Testing validity in C++ is done by comparing the data iterator 
with an "end" iterator for equality, as again, this syntax works with 
pointers as well.  But using two separate iterators for defining the 
bounds of a range:

     * Is cumbersome.  It means passing two parameters to every function 
requiring an iterator instead of one.

     * Is dangerous.  Comparing the "end" iterator of one range with the 
data iterator of another range is legal if the data (and possibly 
container) types match, and can result in an attempt to read/write past 
the end of the referenced sequence.

     * Is unnecessary.  It is more natural, and more common, to perform 
random access operations via array index operations, which leaves 
typical iterator use largely restricted to sequential access.  This 
being the case, why support pointers as valid iterators?

What follows is a rough description of a "smart" read-only iterator for D:

interface Iterator<T>
{
     bool atEnd();
     T next();
     T curr();
     int opApply( int delegate( T ) dg );
}

The nuances are unimportant, but the basic idea is that any such 
iterator should provide a means of advancing, testing whether the 
iterator references a valid element, and of referencing that element. 
Also, opApply is supported to allow iterators to be used in foreach 
operations for simple uses.

Assuming such a design, the iterator type determines progress through 
the container (forward, reverse, stepped, etc) rather than the algorithm 
being applied.  Here is a straightforward implementation of some 
algorithms using such iterators, 'C' is a container type for demonstration:

     bool contains( C cont, E match ) {
         foreach( e; cont ) {
             if( e == match )
                 return true;
         }
         return false;
     }

     I find_first( C cont, E match ) {
         auto i = cont.fwdIt;
         foreach( e; i ) {
             if( e == match )
                 return i;
         }
         return i;
     }

     I find_last( C cont, E match ) {
         auto i = cont.revIt;
         foreach( e; i ) {
             if( e == match )
                 return i;
         }
         return i;
     }

Given the above, it seems awkward that the iterator must be declared 
outside of foreach so a copy may be made to return the position of the 
found element.  Therefore, foreach should allow an arbitrary key type to 
be returned by opApply.  For array operations, the key could continue to 
be an integer representing the array postion., but in sequence 
operations the key could be an iterator representing the current 
position.  Thus, the following operations would be equivalent:

1)  auto i = cont.fwdIt;
     foreach( e; i ) {
         if( e == match )
             return i;
     }

2)  foreach( i, e; cont.fwdIt ) {
         if( e == match )
             return i;
     }

3)  foreach( i, e; cont ) {
         if( e == match )
             return i;
     }

This being the case, I believe iterators must be copyable in a generic 
manner, and ideally, copies must be relatively efficient to perform. 
Therefore, I suggest the addition of a .dup() property to the standard 
iterator interface to flatten the differences between class and 
struct-based iterators, and it may be advisable to use structs as 
iterators in the standard case and forget about classes and interfaces 
altogether.  That said, if a reference to the current iterator could be 
passed around within foreach instead of a copy being made for each 
iteration, then much of the cost could be eliminated.  This should 
easily be possible using the same auto-dereferencing semantics used by 
"inout" parameters in D already.

So in summation, D adequately represents "smart" random access iterators 
via array slice semantics and sequential iteration is best represented 
using a single "smart" iterator object as opposed to a pair of "dumb" 
iterators as per C++.  Finally, the example code above suggests that the 
need for a foreach_reverse statement may not be necessary if fwdIt and 
revIt properties are provided for built-in arrays.  Though the only 
functional difference between:

     foreach( i, e; array )

and

     foreach( i, e; array.fwdIt )

would be the "key" type returned in 'i'.  In the first case it would be 
an integer index and in the second it would be an iterator.

As an aside, foreach_reverse may remain useful in instances where we 
want the index of the current element instead of an iterator.  Or array 
iterators may provide a means to obtain the relevant index.  Perhaps 
someone could make an argument for or against based on these ideas.