User defined type and foreach

Fri Jan 19 18:13:55 UTC 2024

On Friday, January 19, 2024 3:49:29 AM MST Jim Balter via Digitalmars-d-learn 
wrote:
> On Friday, 17 November 2017 at 17:55:30 UTC, Jonathan M Davis
>
> wrote:
> > When you have
> >
> > foreach(e; range)
> >
> > it gets lowered to something like
> >
> > for(auto r = range; !r.empty; r.popFront())
> > {
> >
> >     auto e = r.front;
> >
> > }
> >
> > So, the range is copied when you use it in a foreach.
>
> Indeed, and the language spec says so, but this is quite wrong as
> it violates the specification and design of ranges ... only
> forward ranges are copyable and only via their `save` function. I
> have an input range that can only be iterated once; if you try to
> do so again it's empty ... but the foreach implementation breaks
> that. You should be able to break out of the foreach statement,
> then run it again (or another foreach) and it should continue
> from where it left off, but copying breaks that. I need to know
> how many elements of my range were consumed; copying breaks that.
> I got around this by having a pointer to my state so only the
> pointer gets copied. I would also note that tutorials such as Ali
> Çehreli's "Programming in D – Tutorial and Reference" are unaware
> of this breakage:
>
> "
> Those three member functions must be named as empty, popFront,
> and front, respectively. The code that is generated by the
> compiler calls those functions:
>
>      for ( ; !myObject.empty(); myObject.popFront()) {
>
>          auto element = myObject.front();
>
>          // ... expressions ...
>      }
> "

No spec is being violated, and the behavior is completely expected. The core
problem is that the range API doesn't actually specify the semantics of
copying a range - and actually can't do so without making breaking changes.

D types in general fall into one of three categories with regards to their
copy semantics:

1. value types
2. reference types
3. pseudo-reference types

When you copy a value type, you get two fully independent copies of the
object (e.g. integers are a prime example of this; mutating a copy of an
integer has no effect whatsoever on the original).

When you copy a reference type, you get two fully dependent copies. The type
is either a pointer or a reference (or a struct that holds just a pointer or
a reference), and copying it results in another pointer or reference to the
same object. So, mutating the object affects all pointers or references to
that object.

When you copy a pseudo-reference type, you get a partial copy. Typically,
you're dealing with a struct which has both members which are value types
and members which are reference types. The result is that some operations
will affect only the object being mutated, whereas other operations will
affect other copies of that object. Dynamic arrays are one example of this.
They container a pointer and a size_t which is the length of the array, and
reducing the size of the array by mutating the pointer and the length has no
effect on other dynamic arrays which point to the same data, but mutating
the actual elements affects all dynamic arrays which point to the same data.

What this means for ranges is that depending on how they're implemented, you
get one of four different behaviors when they're copied:

1. If the range is a value type, then copying the range results in two
independent copies, so mutating the copy has no effect on the original. Code
can iterate through both ranges independently.

2. If the range is a reference type, then copying the range results in two
dependent copies, so mutating the copy mutates the original. Any code that
iterates one of the two ranges then affects any code which would try to
iterate the original, but the state is consistent across both ranges,
because rather than containing their data, the data is elsewhere, and they
both point to the same place.

3. If a range is a pseudo-reference type, and its iteration state is copied
by value, then copying the range gives you the same behavior as a value type
as far as iteration goes. Both the copy and the original can be iterated
independently (though depending on the implementation, mutating the elements
themselves could affect both ranges). Dynamic arrays fall in this category.

4. If a range is a pseudo-reference type, and its iteration state is not
fully copied by value, then you end up with the copy and the original being
partially dependent. This means that if you mutate one of them, it will only
partially mutate the other, which tends to mean that the other ends up in an
invalid state. A common situation where this can occur is if the range
stores its front as a member variable, but the rest of its state is stored
in another member variable which is a reference type. If you then call
popFront on the copy, you'd end up with the copy's front changing, but the
original's front wouldn't, and yet, the state they share for the rest of the
range would be mutated, so if you then called popFront on the original, you
wouldn't get the element that the copy got by calling popFront; you'd get
the element after it - meaning that calling popFront on one range would then
cause the other to skip an element. So, with pseudo-reference types that do
not copy their iteration state by value, you basically can never use the
original once you make a copy, because the original will be put into an
invalid / inconsistent state by mutating the copy.

The result of all of this is that you cannot rely on the semantics of
copying a range. When you make a copy, you can rely on that copy having the
same elements in the same order that the original would have had if you
hadn't made a copy (since otherwise, you couldn't pass ranges around), but
you can't rely on what happens to the original when you mutate the copy.
This means that in generic code, once you copy a range, you cannot safely do
anything with the original (since you'll get completely different behavior
from different range types).

In order for forward ranges to be able to get independent copies regardless
of their copy sementics, we have the save function. This means that you can
rely on save giving you an independent copy of the forward range, but you
still can't rely on the semantics of copying the range, and there is no way
to get independent copies of basic input ranges (and you can't rely on
copies of basic input ranges being fully dependent either, since they could
be pseudo-reference types).

This means that in generic code, once you pass a range to foreach, you
cannot use the original any longer, because you've copied it, and the
semantics of how the original are affected by iterating through the copy
depend on how that particular range type is implemented. Non-generic code
can get away with it by relying on the copy semantics of a particular range
type, but generic code can't.

What this means in practice is that if you want foreach to use an
independent copy of the range, you need to use save, and if you want to only
partially iterate a range with a loop and then use the original afterwards,
you cannot use foreach, because it copies the range, which then potentially
puts the original range in an invalid state. And because you can't use save
on a basic input range, once you pass a basic input range to foreach, you
cannot safely use the original any longer. You either fully iterate through
the range or you stop partway through and then never do anything with the
rest of the range (since there is no safe way to do so in generic code), but
there is no way to continue to iterate through the range once you pass it to
foreach and then exit the foreach without relying on the copy semantics of a
specific range type. So, for generic code, you'll need to use something
other than foreach and avoid copying the range.

To fix this, we would need to make it so that the range API defined the copy
semantics of ranges, but that doesn't work as long as forward ranges are
treated as a more advanced version of basic input ranges, because basic
input ranges and forward ranges inherently have different copy semantics.

Basic input ranges cannot be value types (otherwise, they could implement
save), which means that in order to have consistent copy semantics for them,
we would have to require that they be reference types (e.g. by requiring
that they be classes or pointers to structs). On the other hand, we can't
make forward ranges require reference semantics, because that won't work
with dynamic arrays (and really what we want for forward ranges is to have
the same copying semantics as dynamic arrays - i.e. their iteration state is
copied by value). We could make forward ranges have consistent copy
semantics by getting rid of save and requiring that copying a range results
in a range that can be independently iterated (which would mean requiring
that they be dynamic arrays or structs, forcing any reference type ranges to
be wrapped in structs which do the equivalent of calling save internally
when necessary to get an independent copy). And that way, we would have
consistent copy semantics for basic input ranges and consistent copy
semantics for forward ranges - but they wouldn't be the same copy semantics,
so for generic code to be able to rely on the copy semantics, it would have
to only support basic input ranges or only support forward ranges and not
basic input ranges.

This means that in order to fix the problem, we need to change the range
API, which would break code. Hopefully, we'll be able to do that with the
next version of Phobos which is currently in the planning stages, but that
will then only work with new code or code that is updated to use the new
API. Code written to work with the current range API will always have the
problem.

- Jonathan M Davis