Range Redesign: Copy Semantics

Sun Jan 21 05:00:31 UTC 2024

I've been thinking about this for a while now, but with the next version of
Phobos which is in the early planning stages, we really should do some
redesigning of ranges. Most of their API is just fine how it is, but there
are some aspects of it which really should be changed if we want them to be
better (the most obvious one being the removal of auto-decoding). But what
I'd like to discuss specifically in this thread is fixing - and defining -
the semantics of copying and assigning to ranges. Specifically, the
semantics of stuff like

auto copy = orig;

and

range1 = range2;

should be defined (which then defines what happens when passing ranges to
functions and stuff like foreach).

Right now, the only semantics of copying a range which are defined by the
range API are what happens to the copy when you make a copy. Specifically,
the copy will have the exact same elements in the exact same order that the
original would have had if it had been iterated through prior to the copy
being made. If that weren't true, then we couldn't pass ranges around.
However, because ranges support a whole range of types with varying copy
semantics, you can't rely on a range's copy semantics beyond that
(particularly with regards to the state of the original range after the copy
has been made). Specifically, a range type can have four different types of
copy semantics:

1. The range is a value type, so copying it results in two completely
indepedent copies. Iterating through one has no effect on the other.

2. The range is a reference type, so copying it results in two completely
dependent copies. They both share all of their state, so iterating through
one iterates through the other, and they both always have the same state.

3. The range is a pseudo-reference type, and it has value semantics for its
iteration state but reference semantics for some portion of the rest of its
state. This means that if you iterate through a copy of the range, it has no
effect on the original, so the range is a value type in that respect, but
the rest of its state may or may not be copied by value (often - though not
always - this means that the iteration state is directly within the range
type, but the elements themselves are referred to via pointer, so mutating
the actual elements would affect all copies of that range, but each range
can be iterated independently). Dynamic arrays are the most common example
of this sort of range.

4. The range is a pseudo-reference type, but it does not have value
semantics with regards to its iteration state. So, when copying the range,
some portion of its state is copied by value, but you don't end up with
indendepent copies - and in fact, you don't get dependent copies either,
since some of the state is copied by value. A common example of this sort of
range which would be one which contains most of its state via a pointer or
reference but which caches its front as a member variable within the struct.
The result of this is that once you've mutated a copy of the range, you can
no longer use the original, because mutating the copy only partially mutates
the state of the original, leaving it in an inconsistent / invalid state.

Generic code must support ranges with any of these copy semantics, and since
these copy semantics are drastically different from one another, the result
is that in fully generic code, once you copy a range, you cannot ever use
the original again (since you can't depend on its state). And you can't even
assign it a new value and then use it, because the semantics of assignment
aren't defined for ranges either (though they're likely to be strongly tied
to the copy semantics). Once you copy a range, you have to assume that the
orginal range is in a potentially invalid state and can't do anything
further with it.

In addition to copying ranges to new variables and passing them to other
functions, this also means that you can't actually continue to use a range
once it's been passed to foreach, because the lowering that foreach uses
copies the range (and it has to or it would break the semantics of using
dynamic arrays with foreach). So, you can't do stuff like partially iterate
through a range, break out of the loop, and then continue to iterate through
it, because the state of the range is implementation-dependent and could be
invalid (on top of the fact that in the case of many forward ranges, copying
it is equivalent to save, so trying to continue the iteration on the
original won't work, whereas it would work with some basic input ranges, so
even without any ranges ending up in an invalid state, the semantics depend
on the type in a way that does not work with generic code). In fact, someone
necro-ed a thread in D.Learn on this issue with foreach just the other day,
since it's not obvious and can be pretty surprising.

Of course, in order to try to solve the problem with varying copy semantics,
the range API has save for forward ranges. This makes it possible to get an
independent copy of a forward range, and any generic code which needs an
independent copy needs to use it to get one. However, that does not help at
all with the inconsistent semantics of actually copying a range, and it's
very common for people to write code that works with dynamic arrays but then
doesn't work with other types of ranges, because other types of ranges have
differing copy semantics, and they didn't test their code with enough range
types to find where they accidentally relied on the copy semantics (or
assignment semantics) of dynamic arrays (or on the semantics of whatever
range type it was that they tested with).

If we really want to fix these classes of issues, we need ranges to have
consistent copy semantics, which means that we need to define what those
semantics are.

For forward ranges, this is quite straightfoward. We need to require that
copying a forward range results in two copies which can be independently
iterated - so essentially, the iteration state must be copied by value, or
another way to look at it is that copying a forward range must be equivalent
to calling save (though the situation is actually more nuanced than that).
We would be requiring that all forward ranges have the same copying
semantics as dynamic arrays with regards to their iteration state (but would
allow the elements themselves to be copied by value or by reference like we
do now). And of course, we would then get rid of save, since it would be
unnecessary.

What this would mean is that all forward ranges would be either dynamic
arrays or structs. Any code that then wanted a class or a pointer to a
struct to be a forward range would need to wrap it in a struct to give it
the correct copy semantics. A naive implementation would just duplicate the
class whenever the struct was copied, but a smarter implementation could do
something like use reference counting to keep track of how many ranges there
were pointing to that same class object and then duplicate it whenever a
function was called on it which would mutate its iteration state, and the
reference count was greater than 1 (and of course not bother to duplicate it
if the reference count was 1). It wouldn't need to do more memory management
than that (though it could), so that doesn't necessarily require anything
@system or @trusted as often gets discussed with reference counting. Of
course, allowing such an implementation does technically mean that copying a
forward range is not necessarily strictly equivalent to calling save, but
the overall iteration semantics would treat each copy as independent like
save is supposed to allow, so it wouldn't be a problem.

And with that change, we then have consistent copy semantics with forward
ranges. All code operating on forward ranges could treat them the same as
they do dynamic arrays when copying them and passing them around, and you
wouldn't have to worry about forgetting to call save. I think that we can
probably get general agreement on such a change, since it's been discussed
off-and-on for years now. However, the problem then is basic input ranges.

Basic input ranges cannot have the same copy semantics as dynamic arrays. If
they could, then they could be forward ranges. As things stand, they either
have reference semantics, or they have pseudo-reference semantics of the
variety where mutating a copy puts the original in an invalid state. The
only way that we could make their copy semantics consistent is if we then
require that they be reference types. The simplest way to do that would be
to require that they be classes or pointers to structs, though that would
then make it impossible to do anything like reference counting with basic
input ranges, which isn't ideal, but it would have the advantage of allowing
us to force reference semantics at compile time.

This would of course mean that basic input ranges and forward ranges would
then be defined to have different copy semantics, which would mean that
either generic code that relied on the copy semantics would have to work
with only one of them, or we would need to give basic input ranges a
different API. And if we allowed basic input ranges to be structs that
defined reference semantics internally (e.g. by implementing reference
counting on a pointer or class reference), then we would have to give basic
input ranges a different API, because we wouldn't have save to differentiate
them anymore, and if we allowed structs, then we couldn't differentiate them
based on a basic input range being a class reference or pointer to a struct,
whereas a forward range would be a struct or a dynamic array.

Ultimately, I'm inclined to argue that we should give basic input ranges a
new API - not just because it would allow them to use reference counting,
but also because the current input range API tends to force basic input
ranges to cache the value of front (which if anything, encourages basic
input ranges to be pseudo-reference types and could result in an extra layer
of indirection in some cases if they're forced to be reference types). It
would be annoying in some cases to require that different functions be used
for basic input ranges and forward ranges (though overloading could
obviously be used to avoid needing different names), but it's already often
the case that code isn't really designed to work with both, and overloading
on the category of range being used is already fairly common, since
different range capabilities allow for different implementations. So, given
that it would prevent whole classes of copying bugs as well as potentially
remove the requirement to cache front for basic input ranges, I think that a
separate API for basic input ranges is warranted. What I would propose for
that would be a single function

    auto next();

where next returns a nullable type where the value returned is the next
element in the range, with a null value being returned if the range is
empty. The return type would then need to emulate a pointer - specifically,
when casting it to bool, it would be true if it's non-null and false if it's
null, and dereferencing it would give you the actual value if it's non-null
(with it being undefined behavior if you dereference null). So, a basic
input range of ints might define next as

    int* next();

or alternatively, it could be something like

    Nullable!int next();

though that wouldn't work with Phobos' current Nullable type, since it
doesn't support either casting to bool or dereferencing (probably because it
originally used alias this). Either way, since we'd just require the
specific API for the return type rather than requirng a pointer, the range
type would have some flexibliity in what it used.

This would then mean that if you wanted to loop over a basic input range,
you'd do something like

    for(auto front = range.next; !front; front = range.next)
    {
        ...
    }

And if we go down this road, then we could also add this API to foreach,
allowing for code such as

    foreach(e; basicInputRange)
    {
        ...
    }

to work - and unlike now, you could rely on the copy semantics involved such
that you would know that you could then break out of the loop and continue
to use the range (whereas right now, you can't safely break out of a foreach
and then continue to use the range that you were iterating over). Of course,
for forward ranges, breaking out of a foreach still wouldn't let you
continue to iterate from where you were, since the foreach would be
iterating over a copy, but those would be the semantics for all forward
ranges, so you could rely on them.

In addition to these changes to the copy semantics, I propose that we then
require that assignment follow the same semantics as copying. So,

    range1 = range2;

on a forward range would result in two independent copies with the same the
same elements in the same order (regardless of what range1 pointed to
before), just like you'd get with

    auto copy = orig;

whereas for basic input ranges

    range1 = range2;

would result in range1 and range2 being fully dependent copies (regardless
of what range1 pointed to before) - just like what you'd expect when
assigning one pointer to another.

So, the assignment semantics would differ between basic input ranges and
forward ranges, but they would be consistent with their copy semantics, and
they would be consistent across all ranges of the same category (and of
course, assignment would only be expected to work if the two ranges were of
exactly the same type).

The result of all of this is that we could then have consistent, reliable
copy and assignment semantics for both basic input ranges and forward
ranges, and the two would be easily distinguishable so that you don't
accidentally use one as the other and end up with bugs due to the
differences in semantics between the two categories of ranges.

The other potential change related to this would be to rename basic input
ranges. Right now, forward ranges are treated as an extension of input
ranges, which makes sense if you're ignoring the issues surrounding copy
semantics, since a forward range is an input range with an extra function
that allows you to get an independent copy, but if we're giving them
different APIs so that the copy semantics can be cleanly defined and
separated, then arguably, it doesn't make sense to call them both ranges. As
such, we could switch to calling forward ranges simply ranges (which we
mostly do anyway) and give basic input ranges a new name. My initial
suggestion would be an input stream (or if we absolutely had to continue to
call them ranges, then maybe simple ranges), but there's probably a better
name to be had, and obviously, plenty of bikeshedding can be done on that
topic. Either way, the names used here aren't the primary concern.

What I'd like to get out of this thread is feedback on how much sense this
idea does or doesn't make and what problems I'm missing that someone else is
able to see. From what I can see, the main negative is simply that you can't
then write code that works on both a basic input range and a forward range
(though you can obviously still create function overloads so that the caller
can use either), but given the issues surrounding copy semantics, I think
that that's probably ultimately a good change (and the number of functions
that can operate on basic input ranges is already pretty limited anyway in
comparison to forward ranges - particularly with regards to generic
algorithms). It will also make it much easier to discuss the separation
between basic input ranges and forward ranges, which IMHO can be too easy to
lose or confuse as things stand.

- Jonathan M Davis