Proposed Changes to the Range API for Phobos v3

Sun May 19 05:10:09 UTC 2024

On Saturday, May 18, 2024 2:48:51 PM MDT Dukc via Digitalmars-d wrote:
> > 6. The range API does not currently specify what happens if you copy front
> > and then call popFront. For the vast majority of code, it is both assumed
> > that you can copy front and that the copy of front will be unaffected by
> > calling popFront. However, there _are_ ranges (such as std.stdio's ByLine)
> > which will do things like use a dynamic array as a buffer which has its
> > elements overwritten in popFront, resulting in the elements of the copy
> > being mutated, since it's a slice of the same data. Such ranges are
> > sometimes referred to as having a transient front, and they do not work
> > with a lot of range-based code. As such, they either need to be
> > disallowed, or we need a way to detect such behavior so that range-based
> > code can check for it.
>
> I think this one is beyond the scope of the range spec. Whether the
> buffer of `ByLine` gets copied on element copy is an issue about the
> element of the range, not about the range itself. If this needs fixing,
> the element type of `byLine` needs to be changed, not the range spec.
>
> Otherwise `front` would need to always deep copy the element on each
> invocation, which would make any range of ranges horribly inefficient!

It is something that has proven to be a problem. Many range-based functions
will behave incorrectly if they copy front and then popFront mutates the
copy. The range API needs to lock down the semantics involved for
range-based code to actually be able to rely on them. Any range-based code
which makes assumptions which aren't guaranteed by the range API risks
breaking with types that do not follow those assumptions, and that's
something that has been causing us problems for years, because not enough
was locked down. And the possibility of front being transient is one of
those issues that was missed when the range API was originally designed.

If deep-copying elements is an issue, then pointers or reference types
should be used - or the elements should be non-copyable. We are improving
the support for non-copyable types as part of this (whereas right now, most
range-based code assumes that the elements are copyable), but experience has
shown that allowing popFront to mutate the copy of front causes problems,
and the rare case where something like that is actually desirable has
alternate solutions.

In the case of byLine, it can use non-copyable elements, or it can use
opApply, or it could use some sort of reference-counted type so that it can
detect whether there are currently any copies which would be affected if the
buffer was reused. There is no reason why it has to use its current
solution.

Ultimately though, IMHO, it would be better for such a type to not even be a
range than to make it so that range-based code has to worry about this
issue. Experience has shown that it causes problems for most range-based
code, and it really doesn't add much value.

> > 9. const makes ranges useless. Realistically, a const range is not a
> > range,
> > because you can't call popFront, and you can't iterate through it
> > (technically, it _is_ possible to have popFront be const in cases where
> > the
> > iteration state doesn't actually live in the range, and popFront isn't
> > pure, but normal ranges do not work that way).
>
> In principle it goes like: elements being const is a non-issue, it's
> just another element type. Range itself can be const, but has to be cast
> to non-const with const elements to be actually iterable.

Casting away const and then mutating the result violates D's type system.
The only cases where it doesn't is when the cast actually copies the type.
So, as soon as you have something like a range of const ranges, you're
screwed.

> In practice, no way! With some care, you can write a range that copes
> with const elements, but even that is surprisingly thorny since a struct
> having a const field runs into all sorts of issues.
>
> But just try designing a range that can be cast between const and tail
> const! Bar for some niche ranges like TakeNone, it would be extremely
> hard if not impossible with the current language.

Fixing the issue of tail-const realistically requires a language DIP, and
Walter's DIP should make it so that a const range will be able to be
converted to a range of const elements, whereas right now, that only works
with dynamic arrays, because technically, const(Range!Foo) and
Range!(const Foo) are completely different and unrelated types as far as the
type system is concerned (technically, they could even have completely
different implementations in spite of the fact that they come from the same
template).

> > 10. While it's not exactly a problem with the range API, it would be nice
> > if we didn't have to import std.range / std.range.primitives (or the
> > Phobos v3 equivalent) to use arrays as ranges.
>
> Respectful disagree on this one. Ranges are (bar for the foreach range
> API) a Phobos concept, not a language concept. It's good to be able to
> pick whether functions in your module will digest ranges, and if,
> whether it's V2, V3 or some other API. I wish for no changes here.

Well, Walter has specifically requested that we fix it so that the import is
no longer required, and a lot of us would love to get rid of it. Either way,
this will have no effect on the v2 API, because the functions for the v3 API
will be different. And since they only affect ranges, the only code that it
would potentially cause problems for is code that wants to name one or more
free functions which take dynamic arrays with exactly the same names.

Anyone who wants to create their own range API will still be free to do so.

> > Overview of Proposed Changes
> > ----------------------------
> >
> > 3. If a range is finite (that is, it's not statically known to be empty),
> > and it can be default-initialized, then its init value must be empty. If
> > it
> > cannot be default-initialized, then it must define a function called
> > emptyRange which returns an empty range of the same type. Phobos will then
> > define a function called emptyRange which takes a finite range which can
> > be
> > default-initialized and returns the init value of that range.
>
> I suggest a slight relaxation. A finite range must have an empty init
> value if it does not provide an explicit `emptyRange`. That is, any
> range, finite or infinite, may have any valid `.init` value, as long as
> it's `.emptyRange` (whether explicit or Phobos-provided) is actually empty.

That means that you cannot rely on a default-initialized finite range being
valid and empty, which is part of the whole point here. Having that
requirement will reduce bugs, and the lack of that requirement has caused
bugs, particularly since plenty of folks end up assuming that the init value
is empty even when there is no such guarantee. Right now, there isn't even a
guarantee that it's valid, let alone empty.

At least when a range cannot be default-initialized, you get an error when
you attempt instead of just getting buggy behavior. Requiring emptyRange in
that case then allows us a way to get an empty range in spite of the fact
that we can't rely on the range being default-initializable, but by
requiring that finite ranges that can be default-initialized have an init
value that's both valid and empty will prevent bugs due to folks assuming
that the range is valid and empty like has already been happening. And since
the vast majority of ranges can be default-initialized, emptyRange will only
need to be declared in rare cases, and the free function, emptyRange, will
take care of the cases where the range can be default-initialized.

> > 4. All ranges must be either dynamic arrays or structs (and not pointers
> > to
> > structs either).
>
> Maybe unions too? Not that it's often needed.

Unions can't have functions, so they couldn't directly be ranges anyway. But
even if they could, passing around unions is just begging for trouble. They
don't have postblit or copy constructors even when their members would need
it (since the union doesn't keep track of which of its members is the valid
one), and for code to work properly, it really needs to access whichever the
currently valid member of the union is anyway. Trying to do anything with
unions and ranges would just be begging for bugs.

> > 11. Finite random-access ranges are required to implement opDollar, and
> > their opIndex must work with $. Similarly, any ranges which implement
> > slicing must implement opDollar, and slicing must work with $.
> >
> > In most cases, this will just be an alias to length, but barring a
> > language
> > change that automatically treats length as opDollar (which has been
> > discussed before but has never materialized and is somewhat controversial
> > given types where it wouldn't make sense to treat length as opDollar), we
> > have to require that opDollar be defined, or generic code won't be able to
> > use $ with indexing or slicing. We probably would have required it years
> > ago except that it would have broken code to add the requirement.
>
> Why would any language changes be needed for that? Can't `.length` be a
> Phobos function for types that have `opDollar` but not their own `.length`?

Right now, generic code canot use $ with ranges, because the range API does
not require that it be there, and there are many, many ranges which do not
implement it even if they're random-access and/or have slicing. So, generic
code cannot do something like

    auto slice = range[5 .. $ - 1];

or

    auto value = range[$ - 7];

For that to work, we have to actually require that $ work properly for
ranges with slicing and/or random-access. Right now, you're forced to do
stuff like

    auto slice = range[5 .. range.length - 1];

or

    auto value = range[range.length - 7];

which is far more verbose and does not play well with converting code that
operates on dynamic arrays to code that operates on random-access ranges
instead. hasSlicing and isRandomAccessRange likely would have reqired
opDollar years ago, but it wasn't there initially, and adding it later would
have broken too much code.

With the current language, for $ to work with a user-defined type for either
indexing or slicing, that type must implement opDollar, because that's what
the language replaces $ with. There is no other way to have $.

It has been suggested in the past that we make it so that the language
treats length as opDollar if it's present and opDollar is not present, which
would make it so that all random-access ranges automatically got a working
$. However, it has also been brought up that there are types which have both
indexing and length but where it would be inappropriate for length to be
treated as opDollar - the prime example being hash tables / maps where $
wouldn't make any sense at all.

So, barring a language change, the most straightforward way for a finite
range to implement opDollar is just to alias length to opDollar, and finite
ranges without length shouldn't have opDollar anyway. But infinite ranges
will have to define their own opDollar regardless, since they need one
that's a type that just marks the end when slicing, and it makes no sense
for their opDollar to be size_t, since size_t is not infinite.

The proposal here is basically just to require opDollar with slicing and
indexing like we should have originally - and would have if it hadn't been
forgotten.

- Jonathan M Davis