Proposed Changes to the Range API for Phobos v3

Thu May 16 22:21:54 UTC 2024

On Thursday, May 16, 2024 11:05:56 AM MDT monkyyy via Digitalmars-d wrote:
> On Thursday, 16 May 2024 at 14:56:55 UTC, Jonathan M Davis wrote:
> > explict range starters for unicode instead of autodecoding
>
> ok, thats *half* the problem; how does `hello 😀
> world`.byUnicode(flag.charIndex).indexOf('w')` count correctly?
> The existing range api discards information that is otherwise
> trivial to have
>
> the old solution of dchars and autodecoding failed; whats your
> proposal for the unicode problem on the *dchar* side of the
> problem where it was believed that dchar would simplify all use
> cases of unicode into simple indexing

The new range API is quite explictly _not_ trying to do any explicit Unicode
handling. It's doing what has been discussed for years, which is to leave
that up to the programmer and whatever algorithms they choose to use.
Whether it's best to operate at the code unit, code point, or grapheme level
depends on the operation, and issues such as normalization complicate the
situation considerably with regards to any attempt to provide a one size
fits all solution for indexing into graphemes. It's also not efficient to
operate at the grapheme level by default.

The replacement to std.utf will provide the tools to convert from code units
to code points, just like std.utf does now, with the difference being that
the programmer will no longer have to work around the range API functions to
try to prevent decoding from happening automatically.

Code that then needs to operate at the grapheme level will need to use the
replacement for std.uni, and that can include a random-access range of
graphemes. But it won't be the default way to operate, because that's
inefficient, and no normalization scheme actually works in all cases, making
selecting a default problematic.

We may choose to include a string type in Phobos which provides an API that
allows you to operate at the grapheme level by default, but dynamics arrays
of code units will be treated as ranges of code units. The problems that
we've had with auto-decoding have stemmed from trying to treat them as
anything else.

So, we can build whatever useful algorithms or types we want on top of the
built-in strings, and we may be able to provide some better solutions than
we currently have for dealing with graphemes, but the built-in strings will
not have any kind of special-casing as part of the range API. Any ranges of
graphemes will be types of their own and will not be the built-in strings
even if they may wrap the built-in strings.

> > 8. $ cannot be used with random-access ranges in generic code,
> > because the range API does not require that a random-access
> > range define opDollar
>
> I believe you should be much much much more spefic
>
> is [min(i,$-1)] supported? is [$..0] supported?
>
> the slicing api is a rabbit hole that needs allot of care

Finite random-access ranges will need to support the same operations with $
that dynamic arrays do, since the whole point here is to access them in the
same manner as dynamic arrays, and operations such as [$ .. 0] never make
sense, so no, that won't be supported. And no additional syntax for $ is
being added. It's purely for indexing and slicing. The current situation
makes it so that we can't use $ at all with ranges in generic code, because
there is no requirement that it be supported, whereas ideally, we'd be able
to use it in the same way that you would with dynamic arrays. So,
isRandomAccessRange will test to make sure that [$ - 1] compiles and returns
the correct type, and hasSlicing will test to make sure that [0 .. $] and
[0 .. $ - 1] compile and return the correct type. It will be an additional
requirement that $ be equivalent to length, but we can't statically test for
that.

Infinite random-access ranges will not support $ for opIndex, because they
can't and be infinite. They will support it for slicing, but they won't be
able to support doing arithmetic on $, because that makes no sense with an
infinite range. $ will merely be used to indicate that slice is supposed to
go to the end of the range (and therefore that the result will be infinite)
rather than resulting in a finite slice as would occur if only indices are
used.

$ is actually already partially tested for with hasSlicing and infinite
ranges where it requires that opSlice return the same range type if opDollar
is used, whereas it's expected to return a finite range if only indices are
used. However, since there's no requirement that $ work at all, it can't
actually be used in generic code even if you know that the range is
infinite. It's just that hasSlicing requires that the result be the same if
$ is implemented to work with slicing, which is therefore kind of a
pointless test as things stand.

- Jonathan M Davis