Mir Random [WIP]

Fri Nov 25 02:31:26 PST 2016

On Friday, November 25, 2016 08:32:07 Joseph Rushton Wakeling via 
Digitalmars-d wrote:
> On Wednesday, 23 November 2016 at 21:19:51 UTC, Jonathan M Davis
>
> wrote:
> > It's quite feasible if you don't calculate front until it's
> > called or popFront is called, and then you cache the result so
> > that subsequent calls to front prior to the call to popFront
> > return the same result, but it creates additional overhead,
> > because then every call to front and popFront has to check
> > whether the value has been calculated yet. And if every range
> > in the chain of ranges has to do that, then that additional
> > overhead is going to add up.
>
> Yes, indeed.  I actually came up with a general purpose solution
> for that a while back via template mixin (although the version in
> the post linked to is not the final version: I updated it to use
> HaraldZealot's suggestion of making the frontImplementation and
> popFrontImplementation the template parameters):
> https://forum.dlang.org/post/mailman.263.1439926667.13986.digitalmars-d@pu
> remagic.com
> > In general, it's just cleaner to either calculate front on
> > every call to front (which doesn't make sense in the case of
> > random number generators) or to calculate the first front upon
> > construction and be done with it.
>
> In some ways I think it's cleaner to not privilege the first
> front value and go for "true" laziness for all front values --
> particularly when dealing with stuff like RandomSample.

It's much messier from an implementation standpoint. If the constructor
checks for empty and caches front if the range it was given was non-empty,
then every call to front can just return front, and every call to popFront
can just pop the element off and do whatever it does to turn the front from
the wrapped range into the front of this range and then cache it. If the
constructor doesn't do that work, then every call to front and every call to
popFront have to check whether they're the first time that either has been
called and _then_ they end up doing all of the same work anyway. So, there's
more code, and it's less efficient. And if every range in the chain does
that, then you're getting a lot of extra checks just to determine whether
it's the first time that front or popFront is being called. Now, if there's
never any caching of the front value, and front always calculates it, then
that might actually be more efficient if front is only ever called once per
element, because you don't get the cost of copying the element to cache it,
but if front gets called multiple times (which happens fairly often, I
think), then the cost is definitely greater.

You're never going to be able to rely on a fully lazy front anyway unless we
specified it as part of the range API that all ranges work that way, and as
it stands, very few ranges do.

> The downside is that `.front` becomes non-const.

Well, realistically, as it stands, ranges are utterly useless with const
anyway. We'd have to have a standard mechanism for getting a tail-const copy
of a range from a const or immutable range, and we don't.

> Sometimes I
> wonder whether it wouldn't have been better to require that
> `popFront()` be called before even the first call to `.front`,
> and place the onus on `popFront()` to handle any special-casing
> of the first `.front` value.

That would be terrible with chaining ranges. It also would be _very_
error-prone. It's also heading into two-phase initialization territory,
which is almost always a bad idea.

> > And I think that the general consensus has been that
> > calculating front in the constructor and popFront and caching
> > works better than calculating it on every call to front, but
> > that doesn't always work (e.g. map), and that caching does
> > cause issues occasionally.
>
> The case I always think of is: what if your input range is
> designed to correspond to sensor observations made at a
> particular time?  This is a case where cacheing the first value
> on construction is very problematic.
>
> For RNGs it's really not such a big deal so long as successive
> variates are independent, but for random algorithms I think the
> laziness matters, since their popFront is essentially
> IO-dependent (in the Haskell-style sense that randomness is IO).

Honestly, I don't think that it really works to have a range where the value
of its elements depend on when you call front or popFront. The API lets you
do it, but you're going to run into plenty of problems if you do unless you
don't care about the frequency at which the values are generated. Part of
the problem here is that ranges and the algorithms that use them are really
designed with arrays in mind. A non-random-access range reduces those
capabilities, and a basic input range doesn't actually require that the
elements be reproducible aside from multiple calls to front giving the same
result, but it's still the case that it's pretty much assumed that a range
is a fixed set of values, and pretty much nothing is concerned with how fast
front or popFront is called aside for efficiency concerns. So, if someone
has a range that calculates a value when front or popFront is called, and
that value depends on when the function is called, I think that they're just
going to have to deal with the consequences of that. It's already pretty
problematic when you have a range that's a basic input range rather than a
forward range.

Another thing to consider here is that those sorts of concerns really only
apply to basic input ranges. As soon as you have a forward range, the values
of the elements need to be completely predictable - or at least completely
reproducible - meaning that aside from caching a potentially infinite number
of elements to guarantee that save works, nothing that's doing something
like grabbing changing values from a sensor is going to be anything but a
basic input range. So, maybe we should do something special with regards to
basic input ranges and how they're treated in order to better deal with
cases like that, but if we went that route, then we pretty much couldn't
treat them like forward ranges without save anymore. In many cases, we'd
have to have separate implementations that avoided things like caching, and
that would become problematic fast.

I think that there always going to be cases where certain things have some
funky corner cases with ranges - especially the further that they get from
being dynamic arrays. We probably need to do a better job of formalizing
some of this stuff to better avoid some of those corner cases, but I think
that we're always going to be able to find something that we can define as a
range that works in many cases but starts doing weird things if we use it on
a large enough range of algorithms.

- Jonathan M Davis