Proposed Changes to the Range API for Phobos v3

Sun May 19 04:15:00 UTC 2024

On Saturday, May 18, 2024 8:26:18 AM MDT H. S. Teoh via Digitalmars-d wrote:
> On Thu, May 16, 2024 at 08:56:55AM -0600, Jonathan M Davis via Digitalmars-d
> wrote: [...]
>
> > 1. The easy one is that the range API functions for dynamic arrays will
> > not
> > treat arrays of characters as special. A dynamic array of char will be a
> > range of char, and a dynamic array of wchar will be a range of wchar.
> >
> > Any code that needs to decode will need to use the phobos v3 replacement
> > for std.utf's decode or decodeFront - or use foreach - to decode the code
> > units to code points (and if it needs to switch encodings, then there
> > will be whatever versions of byUTF, byChar, etc. that the replacement for
> > std.utf will have).
>
> I thought we already have this?  std.string.byRepresentation,
> std.uni.byCodePoint, std.uni.byGrapheme already fill this need.

Yes, but Phobos v3 will need its own solutions for those - even if it's
simply porting over the existing ones and making them work with the new
range API - because nothing in Phobos v3 is going to depend on Phobos v2 at
all. Most of Phobos v3 is likely to be a cleaned up version of what's in
Phobos v2 rather than something completely different (though in some cases,
there will be more drastic changes), but Phobos v3 is intended to ultimately
replace Phobos v2 so that you won't need to use it for anything any longer
(much as v2 will stick around long term so that old code continues to
compile).

The point here is that because auto-decoding will be gone, you'll need to
explicitly call the functions for decoding if you want decoding, and if you
want graphemes or some other conversion, you'll need to call the appropriate
functions for that. The new range API itself does not deal with them at all.
We do already have such functions for Phobos v2, but anyone using the range
API from Phobos v3 will be calling the Phobos v3 versions of those
functions.

>
> [...]
>
> > However, with infinite ranges, there is no such solution. If they
> > cannot be default-initialized, then they either can't be ranges, or
> > they would have to be finite ranges which would just never be empty if
> > they're constructed at runtime (while doing something like the flag
> > trick to make their init value empty). And it's certainly true that
> > the range API doesn't (and can't) guarantee that finite ranges are
> > truly finite, but it's still better if we can define infinite ranges
> > that need to be constructed at runtime as infinite ranges, since then
> > we can get the normal benefits that come from statically knowing that
> > a range is infinite.
>
> Infinite ranges also have the peculiarity that slicing may create a
> finite range, i.e., the underlying type changes. That's another wrinkle
> to deal with.

Yes, though I think that the decision that I've gone with of making it so
that you're allowed to have finite and infinite ranges disable default
initialization while requiring that their init value be valid if they don't
disable it (and further requiring that init be empty for finite ranges)
avoids any real complications with slicing.

Any infinite range with slicing will have to define a finite range that
follows all of the rules for finite ranges, which would include the
requirements for initialization, but since the programmer has the full
freedom to implement it like they would any other finite range, I don't
think there's really much different about such ranges. Worst case, it's
similar to a finite range that needs be constructed at runtime to really
function properly (since those can still support default initialization by
adding a flag to indicate whether they were default-initialized or not and
then just have empty be true if they were).

Now, if we were to require that all ranges be default-initializable, since
some infinite ranges can't work that way, those ranges would then have to be
implemented as finite ranges that were never actually empty, and since
slicing has to return the same type for finite ranges, they would end up
with a more complex implementation, because the range would presumably have
to check internally whether it was the "infinite" version or the truly
finite version in portions of its code. So, that approach would be uglier
because of that, but it could still be implemented.

If we required that finite ranges be default-initializable but allowed
infinite ranges to disable default initialization, then an infinite range
with slicing which disabled default initatialization _would_ have to be able
to return a finite range which could be default-initialized, and that would
then require doing something like using a Nullable to contain the infinite
range that it would presumably be wrapping so that the finite range could be
default-initialized (as well as being able to determine that it was empty
when the infinite range hadn't been initialized). So, that would also be
more complicated.

However, in both of those cases, it _would_ still be possible to correctly
implement the finite range. It would just be annoying, whereas in the first
case, it would be impossible to implement the infinite range as an infinite
range if it can't be default-initialized and work correctly. So, that's
definitely the larger issue.

In any case, going with the rule where we allow ranges to disable default
initialization but require that their init value be valid if they can be
default-initialized (and require that the init value be empty for finite
ranges) avoids all of those complications. It does have the downside of
generic range-based code having to deal with the possibility of ranges not
being default-initializable, but that is a general problem with generic code
rather than a range-specific one, and the solutions for it are not
range-specific.

>
> [...]
>
> > 4. All ranges must be either dynamic arrays or structs (and not
> > pointers to structs either).
>
> This is not necessarily an ideal solution.  In my own code I've often
> had to iterate over forward ranges via sub functions, where I expect the
> iteration state after returning from the sub function to be retained.
> There are two ways to do this:
>
> 1) Have two versions of every iteration function, one taking the range
> by value (with implicit saving of current iteration state), the other
> taking the range by reference (retain changes to iteration state upon
> return), which leads to a lot of code duplication; or:
>
> 2) Have a single version of each function and pass the range by
> reference, usually by passing a pointer to it (since the current API
> would transparently treat the pointer as the range itself).
>
> Prohibiting pointers eliminates option (2), and leaves me with the
> non-ideal situation (1) where I need lots of code duplication.
>
> Although, come to think of it, we could have a .byRef range wrapper that
> encapsulates a pointer to the range so that changes to iteration state
> would be preserved.  But then it begs the question, why not just allow
> pointers in the first place?  Why require jumping through extra hoops?

The short answer to this is that we cannot allow forward ranges to be
reference types and get rid of save, and the fact that save has been part of
the range API has proven to be a big mistake. Along with the removal
auto-decoding, it's the #1 change that we've talked about wanting to do for
years now. I wrote a really long reply to this before explaining a bunch of
details about why locking down the copy semantics matter and deleted it,
because it was getting way too long, but ultimately, the fact of the matter
is that if we want to get rid of save, we have no choice but to require that
copying forward ranges results in an independent copy. If you can make that
work with some sort of wrapper type without violating the range API, then
have at it (though my experience with RefRange is that it was a big mistake
on my part and that attempting such semantics is too error-prone). Either
way, save needs to go.

And the reality of the matter is that any code that currently passes a range
to a function by value and then does _anything_ with that range after that
is relying on the copy semantics of that particular range type as well as
the current implementation of the function being called, since generic code
cannot rely on the copy semantics of ranges, and if the function was going
to make any promises about the state of the range afterwards, it would have
taken it by ref. So, unless you control all of the code in question, you're
relying on unspecified behavior and are just lucky that it happens to work
right now. And if you are in full control of the code, then you can do
whatever you want regardless of what the range API says, though Phobos
functions will be statically enforcing as much of the range API as they can,
so it's true you won't be able to do stuff like use a reference type range
with any Phobos code.

But in order get rid of save and lock down the copy and assignment semantics
of ranges so that generic code can actually rely on those semantics,
reference type ranges have to be banned - or they'll have to be non-copyable
ranges. We _could_ make it so that basic input ranges are required to be
reference types instead of non-copyable, but that creates a different set of
issues, and it wouldn't help you if you want forward ranges which are
reference types. Those simply aren't possible if we get rid of save unless
we force all forward ranges to be reference types, which would mean having
to wrap arrays to make them be ranges as well as significantly increasing
how much D code would need to allocate on the heap.

> > 11. Finite random-access ranges are required to implement opDollar,
> > and their opIndex must work with $. Similarly, any ranges which
> > implement slicing must implement opDollar, and slicing must work with
> > $.
> >
> > In most cases, this will just be an alias to length, but barring a
> > language change that automatically treats length as opDollar (which
> > has been discussed before but has never materialized and is somewhat
> > controversial given types where it wouldn't make sense to treat length
> > as opDollar), we have to require that opDollar be defined, or generic
> > code won't be able to use $ with indexing or slicing. We probably
> > would have required it years ago except that it would have broken code
> > to add the requirement.
>
> [...]
>
> Will this also require implementing arithmetic operators on the return
> type of opDollar? Otherwise things like r[0 .. $-1] still wouldn't
> work correctly. Or r[0 .. complicatedMathFunc(($-1)/2)].

I have a detailed reply on this topic in my reply to Steven, but the short
answer is that yes, they'll be required for opDollar with finite ranges
(they make no sense for infinite ranges), but it's probably going to make
sense to just require that opDollar evaluate to size_t, since length already
has to be size_t, and I don't think that we want to change that.

- Jonathan M Davis