Why must a bidirectional range also be a forward range?

Sat Sep 21 08:04:18 UTC 2019

On Friday, September 20, 2019 7:08:03 AM MDT Joseph Rushton Wakeling via 
Digitalmars-d-learn wrote:
> On Thursday, 19 September 2019 at 22:55:55 UTC, Jonathan M Davis
>
> wrote:
> > For better or worse, ranges were more or less set up as a
> > linear hierarchy, and it's unlikely that use cases for
> > bidirectional ranges which aren't forward ranges are common. I
> > expect that it's a bit like infinite, bidirectional ranges. In
> > theory, they could be a thing, but the use cases for them are
> > uncommon enough that we don't really support them. Also, I
> > expect that most range-based algorithms which operate on
> > bidirectional ranges would require save anyway. A lot of
> > algorithms do to the point that basic input ranges can be
> > incredibly frustrating to deal with.
> >
> > [ ... ]
>
> Thanks for the characteristically thorough description of both
> the design considerations and the history involved.
>
> On reflection it occurs to me that the problem in my thinking may
> be the idea that `save` should result in a full deep copy.  If
> instead we go by how `save` is implemented for dynamic arrays,
> it's only ever a shallow copy: it's not possible to make valid
> assumptions of reproducible behaviour if the original copy is
> modified in any way.
>
> If instead we assume that `save` is only suitable for temporary
> shallow-copies that are made under the hood of algorithms, then
> my problems go away.

save is supposed to result copies that can be independently iterated over.
So, code such as

foreach(r; range.save)
{}
auto arr = r.save.array();
assert(equal(arr, r));

should work. How that's implemented underneath the hood doesn't really
matter. However, none of that really takes into account mutation of the
elements. The range API pretty much assumes that you don't ever modify any
of the elements in the range as you iterate over them. So, if you do
something like

auto orig = range.save;
range.front = 42;

whether orig.front is then 42 is implementation-dependent. So, if you're
modifying elements as you go, then the behavior you get is going to be
highly dependent on what you're doing with the ranges, and certainly, if
you're only using a range within a very specific context, it can be
implemented in a way that works in that context but doesn't work with
range-based functions in general. You just run the risk of problems if you
then later modify the code to use other range-based functions which don't
necessarily work with whatever you've done with the range.

As for temporary, shallow copies, IIRC, isForwardRange requires that save
returns exactly the same type as the original. So, while you can certainly
have a range referring to a data structure without owning any of the data
(after all, that's what happens with dynamic arrays), you can't have a range
of one type that owns the data and then have save return a range type which
just refers to the data unless the range is written in a way that you can
have both within the same type.

One example of avoiding the need to deep-copy with save where one range is
at least sort of the owner whereas the others aren't is how dxml's stax
parser works. The ranges share a context that keeps track of how far into
the range the farthest range is, and popFront only does the validation when
the range that popFront is being called on is the farthest of any range.
That way, the stuff related to validating end tags didn't need to be
deep-copied, but save always returns exactly the same type, and you get
exactly the same behavior regardless of which range gets iterated farther
first (or even if one is iterated farther and then another is iterated
beyond it).

If popFront every throws, then that range becomes invalid, but the others
are fine. The validation other than that for matching end tags currently all
gets done every time, so all ranges would throw in the same place for errors
other than end tags that don't match, but the same is also true for when the
end tags don't match, because even though that validation is only done for
the farthest range, if it fails, the shared context is left in exactly the
same state, and any other ranges that reach that point would then throw like
the first range did.

Without realizing that the validation for the end tags didn't have to be
done for every instance of the range but only the one which was farthest
along, I would have been screwed with regards to save, because deep-copying
would have been required. I'm not sure that that particular trick is widely
applicable, but it is an example of how save can do something other than a
deep copy even though having each range do exactly the same work would have
required a deep copy.

> > Assuming we were redesigning the range API (which may happen if
> > we do indeed end up doing a Phobos v2), then maybe we could
> > make it so that bidirectional ranges don't have to be forward
> > ranges, but honestly _any_ ranges which aren't forward ranges
> > are a bit of a problem. We do need to support them on some
> > level for exactly the kind of reasons that you're looking to
> > avoid save with a bidirectional range, but the semantic
> > differences between what makes sense for a basic input range
> > and a forward range really aren't the same (in particular, it
> > works far better for basic input ranges to be reference types,
> > whereas it works better for forward ranges to be value types).
>
> It occurs to me that the distinction we're missing here might
> between "true" input ranges (i.e. which really come from IO of
> some kind), which indeed must be reference types, versus "pure"
> input ranges (which are deterministic, but which don't
> necessarily allow algorithms to rely on the ability to save and
> replay them).

Well, in general, the difference between a basic input range and a forward
range is whether you can cheaply copy their state to then have an
independent copy to iterate over. If you can't cheaply copy their state,
then they're fundamentally going to be either reference types or
pseudo-reference types. The closest I can think to having a basic input
range that was a value type would be something like a bufferless socket
which just retained the socket ID (so, it would be just contain an integral
value), but since that ID is pointing to the data elsewhere, it's still
basically a reference type. I don't see how you could have a basic input
range which couldn't be a forward range which was truly a value type.

You could have a situation where you have a range which was a value type but
too expensive to copy for it to be a forward range, but at that point, you
just wouldn't make it a range, since ranges get copied all over the place,
and it really doesn't work for them to be too expensive to copy. You'd
pretty much have to create a separate type which referred to it which was
the range.

You can certainly have types which are reference types or pseudo-reference
types which are forward ranges, so they don't have to be value types, but
they do have to have a way to create independent copies to iterate over the
same data, or they can't be forward ranges. So, forward ranges don't _have_
to be value types in that sense, but the way they get passed around pretty
much assumes that they're value types. Unless a range-based function takes
its argument by ref, you pretty much have to assume that range is going to
be consumed and call save when passing it if want to continue to use the
range. So, really, we'd just be better off if we required that all forward
ranges be value types and that copying them was equivalent to save (so, copy
constructors would have to be used in lieu of save if the default copy
wasn't equivalent to save, and classes would have to be wrapped by structs
with copy constructors). That eliminates all of the bugs that we get which
relate to save not being called when it should be, because the code assumes
that copying is save.

So, I think that what should be done with forward ranges is pretty clear,
and as I understand it, Andrei agrees. What's harder is basic input ranges.
If we can assume that forward ranges are value types, then that makes it so
that the copying semantics of all forward ranges is well-defined (unlike
now), but basic input ranges _can't_ have the same copy semantics, or they
could be forward ranges. They either are reference types (so, mutating the
copy mutates the original), or they're pseudo-reference types (so, mutating
the copy only partially affects the original). The result is that you still
can't assume the semantics for copying basic input ranges. You also can't
assume the semantics of copying a range when the code operates on both basic
input ranges and forward ranges.

If basic input ranges had to be reference types, then we _could_ assume the
semantics for copying them - but only if the code for operating on basic
input ranges is not shared with the code for operating on forward ranges,
and with the current hierarchy, it's a forward range is an input range, so
their code is pretty much always shared so long as the code can operate on
basic input ranges.

That problem there is the main one that stumps me right no with what we
should do with basic input ranges if we were to do a range API redesign. We
can make the semantics for copying (and presumably assignment as well) be
well-defined for forward ranges by requiring that copying be equivalent to
save (and getting rid of save), but that still leaves the semantics for
copying and assigning basic input ranges as a mess. Requiring that basic
input ranges be reference types improves the situation, but it doesn't
completely fix it, and it would require that some basic input ranges be
allocated on the heap when that wouldn't currently be necessary.

In some respects, it does make perfect sense to treat forward ranges as
being basic input ranges with more capabilities, but in other respects,
they're very different beasts to the point that it seems wrong to treat them
as being related.

LOL. And given how hard it is to work with basic input ranges, since you
can't save their state, in some respects, I'm tempted to say that we'd be
better off if they weren't even a thing, but some basic stuff like sockets
and other I/O don't work well at all with forward ranges because of the
buffering that's required. Honestly, even though ranges are basically
streams, they seem to work bizarrely badly for I/O. It seems to me that
ranges are a fantastic tool for the cases where forward ranges work, and
their API works reasonably well for basic iteration when you can't have a
forward range, but going from something that really just works with basic
iteration and going to where that then works with the larger range API
really doesn't work well. It seems like we need a different solution to
bridge that gap, and I really don't know that solution should look like. I
should probably study what Steven did with iopipe and see how that fits into
things.

In any case, that's probably enough rambling on the subject. I do sure wish
that we could find a better way to deal with the gap between basic input
ranges and forward ranges though as well as manage to fix it so that the
semantics of the copying and assignment of ranges in general can be properly
specified and relied upon.

- Jonathan M Davis