Tricky semantics of ranges & potentially numerous Phobos bugs

Tue Oct 16 12:07:01 PDT 2012

On Tue, Oct 16, 2012 at 07:28:38PM +0200, Jonathan M Davis wrote:
> On Tuesday, October 16, 2012 10:13:57 H. S. Teoh wrote:
[...]
> ByLine is perfectly valid range insofar as you realize that it's
> likely to go completely south if you use it in any way that could
> involve keeping front around after popFront has been called, which
> means that anything which relies on keeping front around isn't going
> to work. So, it's a range, but it's essentially an unsafe one (though
> I'm not sure that it's an un- at safe one).
> 
> So, it's fine that ByLine is a range as long as we're willing to put
> up with it not working with a lot of range-based functions because of
> its abnormal behavior. But I don't think that it's at all reasonable
> for range-based functions in general to not be able to rely on front
> returning the same type every time or on its value disappearing or
> becoming invalid in some way after a call to popFront. That's
> completely untenable IMHO.

I think a better solution is to somehow differentiate between ranges
whose .front value is permanent (arrays), and ranges whose .front values
are transient (e.g. byLine, but we shouldn't single it out because there
are other occasions where you may *want* to have a range that doesn't
call .dup every time -- say a range over all permutations of an array
that modifies the array in-place).

Perhaps mark ranges with an .isTransient property (which can be an enum
so it doesn't waste any space), and range-based functions that require
non-transient ranges will refuse to work with a transient range. Or
alternatively, switch to a different implementation that takes
transience into account (which may be slower, etc., but the important
thing is to do it right -- after all, D's motto is safe by default,
unsafe if you ask for it).

> Ranges _can_ define semantics which violate that, but they have to
> make it clear that they do so that programmers using them realize that
> they may not work right with a lot of range-based functions (which
> potentially makes it so that it they really shouldn't have been ranges
> in the first place).

I think this approach of what amounts to guessing which ranges are
safe/unsafe with which functions is what's untenable. We all know that
people don't read documentation, or at least, not thoroughly. Something
like this is too easy to miss, and bugs will slip into code unnoticed
for a long time, and then explode in your face. It's unsafe by default,
which goes against D's philosophy.

I think the problem is better addressed by an .isTransient property that
can be selected by templates at compile-time.  I agree that requiring
*every* range function to expect .front to mutate upon calling
.popFront() is unreasonable, but *some* functions *can* be written in a
way that they will still work. Marking ranges with transient .front
values will allow functions that *can* take transient ranges do so,
while leaving other functions accepting only non-transient ranges. That
way, if somebody tries to pass a transient range to a template that only
works with a non-transient range, they will know at compile-time instead
of getting a nasty surprise later on.

> > So what is (or should be) the semantic design of ranges? Let's work
> > out a precise definition so that we have something to build on.
> 
> As far as front (or back) goes, range-based functions should be able
> to rely on
> 
> 1. front returning the exact same value on every call until popFront
> has been called (though there's no guarantee that front won't have to
> be recalculated on each call, so assigning the result of front to a
> local variable is advisable for efficiency if front must be used
> multiple times before a call to popFront).
> 
> 2. the result of front continuing to be valid and unchanged after
> popFront has been called if it was assigned to a variable.
> 
> Any range is free to violate this, but because range-based functions
> are free to rely on it, such ranges risk not working correctly with
> many range-based functions and must be used with caution.
[...]

I don't like the idea that some ranges _may_ or _may not_ work with some
functions. That's just too unreliable, and it's too easy to make
mistakes. Let's put in an isTransient property on the unsafe ranges and
let the templates' signature constraints enforce passing the right kind
of range.

On Tue, Oct 16, 2012 at 10:58:23AM -0700, David Gileadi wrote:
[...]
> As a D dabbler, it seems wrong to me that a built-in range like
> byLine, which is often used in example code, should have weird side
> effects with the built-in methods in std.algorithm.  IMO this needs to
> be fixed one way or another, and a mere documentation change is likely
> not enough for casual D users.

I agree. We should add an isTransient property to at least the built-in
ranges in Phobos, so that at least those ranges will always work
properly (or complain loudly at compile-time if passed to the wrong
function).

User-defined ranges need to be marked too, but that is up to the user. I
don't think this part can be statically checked (unless you want to
enforce it using the type system by making non-transient ranges return
immutable from .front -- but that may be impractical in some cases). So
this has to be documented, and users will have the option of marking
their own ranges and get the benefit of compile-time compatibility
checks. But at the very least, the Phobos ranges must always work
correctly, or refuse to work at compile-time. Leaving things up to
convention (via documentation) is not a good idea.

T

-- 
ASCII stupid question, getty stupid ANSI.