Iterator straw-man
Sean Kelly
sean at f4.ca
Tue Nov 7 09:31:14 PST 2006
C++ supports pointers as iterators largely for compatibility with
existing/old code and because of the prevalence of pointer use in the
language. D however, is not burdened with the need for backwards
compatibility, nor is pointer use at all common. Languages without
pointers typically implement iterators as a single object that is itself
aware of the end of the referenced sequence. But while this works quite
well for sequential operations, it falls apart somewhat for random
access operations, particularly in languages lacking operator overloading.
Unlike C++, D contains a fairly robust built-in dynamic array type.
These arrays may be passed by value efficiently and contain both
built-in property methods and user extension of property semantics using
the following convention:
void mutate( char[] c, int i );
char[] c;
c.mutate( 0 );
D also supports a slice syntax which is roughly comparable to a
self-contained random access iterator. Thus "abcdef"[0 .. 2] returns a
dynamic array type supporting all the typical array operations,
including the length property for determining an end point. Slicing is
supported for user-defined types as well, so this method may be applied
to any container supporting random access.
Given the presence of D's slice syntax, it would be a shame to ignore it
in favor of pointer-style random access iterators for algorithm use.
And while the idea of a single iterator design for all access methods
(sequential and random access) seems vaguely appealing, it relies
heavily on traits templates to allow proper function overloading for
different iterator categories. In fact, overloading template functions
for different iterator categories is so annoying that the C++ committee
is adding concept checking to C++ 0x to simplify the implementation of
such overloaded template functions.
I believe D's slice syntax offers a perfect opportunity to provide a
reasonable, intuitive syntax for robust "smart" random-access iterators.
The built-in array type will work as-is, and a user-defined random
access container could support the same syntax by providing opIndex,
opIndexAssign, opSlice, and opSliceAssign for both the container and
iterator types. Obtaining an iterator on a random access container,
then, would be done via a slice operation:
auto i = myCont[0 .. $];
foreach( e; i ) {} // sequential access
auto e = i[0]; // random access
The remaining iterator categories in C++ are essentially variations of
forward iterators and bidirectional iterators, and could roughly be
described as requiring a means of advancing, testing whether the
iterator references a valid element, and of accessing the referenced
element. Testing validity in C++ is done by comparing the data iterator
with an "end" iterator for equality, as again, this syntax works with
pointers as well. But using two separate iterators for defining the
bounds of a range:
* Is cumbersome. It means passing two parameters to every function
requiring an iterator instead of one.
* Is dangerous. Comparing the "end" iterator of one range with the
data iterator of another range is legal if the data (and possibly
container) types match, and can result in an attempt to read/write past
the end of the referenced sequence.
* Is unnecessary. It is more natural, and more common, to perform
random access operations via array index operations, which leaves
typical iterator use largely restricted to sequential access. This
being the case, why support pointers as valid iterators?
What follows is a rough description of a "smart" read-only iterator for D:
interface Iterator<T>
{
bool atEnd();
T next();
T curr();
int opApply( int delegate( T ) dg );
}
The nuances are unimportant, but the basic idea is that any such
iterator should provide a means of advancing, testing whether the
iterator references a valid element, and of referencing that element.
Also, opApply is supported to allow iterators to be used in foreach
operations for simple uses.
Assuming such a design, the iterator type determines progress through
the container (forward, reverse, stepped, etc) rather than the algorithm
being applied. Here is a straightforward implementation of some
algorithms using such iterators, 'C' is a container type for demonstration:
bool contains( C cont, E match ) {
foreach( e; cont ) {
if( e == match )
return true;
}
return false;
}
I find_first( C cont, E match ) {
auto i = cont.fwdIt;
foreach( e; i ) {
if( e == match )
return i;
}
return i;
}
I find_last( C cont, E match ) {
auto i = cont.revIt;
foreach( e; i ) {
if( e == match )
return i;
}
return i;
}
Given the above, it seems awkward that the iterator must be declared
outside of foreach so a copy may be made to return the position of the
found element. Therefore, foreach should allow an arbitrary key type to
be returned by opApply. For array operations, the key could continue to
be an integer representing the array postion., but in sequence
operations the key could be an iterator representing the current
position. Thus, the following operations would be equivalent:
1) auto i = cont.fwdIt;
foreach( e; i ) {
if( e == match )
return i;
}
2) foreach( i, e; cont.fwdIt ) {
if( e == match )
return i;
}
3) foreach( i, e; cont ) {
if( e == match )
return i;
}
This being the case, I believe iterators must be copyable in a generic
manner, and ideally, copies must be relatively efficient to perform.
Therefore, I suggest the addition of a .dup() property to the standard
iterator interface to flatten the differences between class and
struct-based iterators, and it may be advisable to use structs as
iterators in the standard case and forget about classes and interfaces
altogether. That said, if a reference to the current iterator could be
passed around within foreach instead of a copy being made for each
iteration, then much of the cost could be eliminated. This should
easily be possible using the same auto-dereferencing semantics used by
"inout" parameters in D already.
So in summation, D adequately represents "smart" random access iterators
via array slice semantics and sequential iteration is best represented
using a single "smart" iterator object as opposed to a pair of "dumb"
iterators as per C++. Finally, the example code above suggests that the
need for a foreach_reverse statement may not be necessary if fwdIt and
revIt properties are provided for built-in arrays. Though the only
functional difference between:
foreach( i, e; array )
and
foreach( i, e; array.fwdIt )
would be the "key" type returned in 'i'. In the first case it would be
an integer index and in the second it would be an iterator.
As an aside, foreach_reverse may remain useful in instances where we
want the index of the current element instead of an iterator. Or array
iterators may provide a means to obtain the relevant index. Perhaps
someone could make an argument for or against based on these ideas.
More information about the Digitalmars-d
mailing list