[phobos] Input range API; differencing ranges

Sat Jun 18 10:33:39 PDT 2011

Hi everybody. I had been discussing some ideas regarding range API and algorithm design with Andrei. Not to suggest a grand rearchitecting of std.algorithm or anything like that, just exploring the design space. He recommended I bring the conversation here. To get everyone caught up:

- Since the input range concept is extremely limited compared to its more flexible children, I had suggested that in a parallel universe, the API for InputRange could be boiled down to a single function that tests emptiness, reads a value, and pops the front all in lockstep:

	maybe!ElementType get(InputRange r);

Andrei pointed out that having get() return a maybe container would preclude the return value of a mutable range's get() from being an lvalue. Variations such as having get() return a pointer of course have the inverse problem of requiring buffering and lvalue-ness even for ranges that aren't mutable and have no conceptual in-memory representation. However, I think there's a solution: if, instead of returning a pointer, you used a maybeRef!Element object that quacks like maybe!element but provides its "just" value as a reference instead of a value:

	struct maybe(T) { bool isJust; T value; }
	struct maybeRef(T) { T *value; }

	bool isJust(maybe!T m) { return m.isJust; }
	T just(maybe!T m) { assert(m.isJust); return m.value; }

	bool isJust(maybeRef!T m) { return !null(m.value); }
	ref T just(maybeRef!T m) { assert(!null(m.value)); return *m.value; }

Then your foreach loop is:

	for (auto elt = get(r); isJust(elt); elt = get(r)) {
		auto ref x = just(elt);
		...use x...
	}

So get() returns a maybe!Element if it's a value, or maybeRef!element if the element is an lvalue. This way, x is mutable when appropriate, so you only need one foreach implementation for all ranges, and the get() interface enforces single-pass behavior in user code that expects an InputRange. For more capable ranges, empty/front/popFront is just a particular factoring of "get"—though ForwardRange isn't a proper refinement of InputRange in this design, it can at least be considered an instance thereof. On the other hand, having yet another axis of genericization with maybe/maybeRef is annoying. Are there other weaknesses to this design?

- There are some problems for which deriving the difference of two ranges would be useful. For example, in a world where the findSplit* functions aren't provided for you, and you wanted to find the range *before* your needle rather than after, you would want to be able to say:

	frontDifference(range, find(range, needle))

instead of the more awkward

	reversed(findLast(reversed(range)))

or

	auto foundRange = find(range, needle);
	return range[0..(range.size - foundRange.size)];

There are of course problems with the difference operation:

- It's difficult to generically assert that one range is a subset of another and that the difference operation is sensible.
- While the difference operation seems intuitive if you think of a range as two STL iterators mashed together, there are other valid ranges for which the difference operations would be awkward (or at least no better than slicing or take()ing from the range).

One safe, generic alternative to the difference operation I've been thinking about is a splittingRange adapter, that conceptually works like this:

struct splittingRange(R) {
	IteratorType!R originalBegin, begin, end;
	this(R r) {
		originalBegin = begin = r.begin;
		end = r.end;
	}

	bool empty() { return begin == end; }
	ElementType!R front() { return *begin; }
	void popFront() { ++begin; }

	R detachFront() { auto o = originalBegin; originalBegin = begin; return R(o, begin); }
}

The "find the range before the needle" example would then be:

	find(splittingRange(range), needle).detachFront()

Advancing the splittingRange, in addition to popping the front of the original range, also conceptually pushes the endpoint of an internally-kept front range, an operation that should be safe since the union of the two internal ranges always equals the original valid range. The splittingRange provides an additional detachFront() operation that gives you the range you've advanced over, then resets the front range maintained by the splittingRange. Given the public ForwardRange and RandomAccessRange interfaces as-is, splittingRange could be implemented in terms of take() or opSlice(). Iterator-based ranges could overload splittingRange() in a more efficient way, but it wouldn't be a strict interface requirement. You could even generalize splittingRange to InputRanges by having the "front range" be a vector you populate with values as you read them off the "back range", which would allow you to implement things like lines() or words() in a single way that works on both files and in-memory datastructures.

-Joe