Dealing with ranges where front and popFront do the same logic / eager ranges

Wed Oct 17 00:12:13 UTC 2018

On Tue, Oct 16, 2018 at 10:59:50PM +0000, Dennis via Digitalmars-d-learn wrote:
> I've always been curious around the design choice of ranges to make
> front and popFront separate functions, instead of having popFront
> return the front. I suppose it is useful sometimes to be able to
> access front multiple times without having to save it temporarily
> (can't think of a specific example though).

IMO, it is a good design choice, though I have also encountered the
issues you mention myself.  The main advantage is the ability to make
decisions with a 1-element lookahead -- you can examine the next element
without consuming it, which is useful in writing parsers, when the next
token may not be a part of the current parse subtree, so a subtree
parsing routine can just gracefully exit without consuming the next
token (and requiring stupid hacks like unputting the token back so that
another parsing routine can pick it up), and let the upper-level
routines take care of the case where the next token is actually invalid.

> But sometimes doing popFront requires doing the same work as front.
> For example, for strings and autodecoding, both front and popFront
> need to determine the length of the next code point. Now for utf-8
> that is still managable, but I'm making a tokenizer range for a
> programming language. Most of the logic in `front` needs to be
> duplicated for `popFront`! (and `empty` as well)
> 
> So I often end up with designs like:
[...]

Your solution is how I would do it too.  There's no point in duplicating
the work in .front; calling .front multiple times shouldn't recompute
the same thing over and over again. Just do the work once and save it,
then return the same result if the caller calls .front multiple times.

If you're worried about the size of the struct becoming too big, just
add a fake Token type that marks EOF, and check for that in .empty
instead of the `bool empty`.

[...]
> This increases the size of the Range struct, and makes the range one
> element eager. If the first token is invalid, and I use Exceptions,
> then merely constructing the range will immediately throw an
> Exception. I recall reading that constructors throwing exceptions are
> problematic, but I don't have any experience with that. So maybe I
> should add a try-catch in the constructor, and rethrow the Exception
> upon the first 'front' call? That feels hacky.
[...]

It *is* hacky.  I'd just throw the Exception in the ctor, and not bother
with rethrowing it in .front.  I'm not sure what's the reasoning behind
the saying that throwing exceptions in ctors is bad, but exceptions are
exactly the kind of thing designed for handling this sort of situation.
If the parser detects a problem early (i.e., at construction time,
rather than the first call to .front), why not report the problem early
instead of waiting?

Now, there *is* one situation where the first-element-eager behaviour of
ranges isn't desirable, and that is if you're reading live user input
from stdin (e.g., in a REPL), and you want to do something like print a
prompt before the first input line is read.  In this case, the mere act
of constructing the parser would block on input, likely before the code
has a chance to print the prompt.  I've encountered this situation on
multiple occasions, and have found it truly aggravating, esp. if you're
using std.stdio.File.byLine, which has to read a line of input before it
can know the value of .empty, so the mere act of calling .byLine
requires you to print a prompt first, even if you structured your code
differently for better encapsulation.

My eventual solution is to wrap .byLine inside a struct that lazily
constructs the ByLine instance:

	// Implementation of byLinePrompted.
	private struct ByLinePrompted(alias writePrompt, File)
	{
	    private File f;
	    private alias ByLine = typeof(f.byLine());

	    import std.typecons : Nullable;
	    private Nullable!ByLine _impl;
	    private ref ByLine impl()
	    {
		if (_impl.isNull)
		{
		    writePrompt();
		    _impl = f.byLine;
		}
		return _impl;
	    }

	    @property bool empty() { return impl.empty; }
	    @property auto front() { return impl.front; }
	    void popFront()
	    {
		writePrompt();
		impl.popFront();
	    }

	    static assert(isInputRange!(typeof(this)));
	}

	/**
	 * Returns: A lazily-initialized wrapper around File.byLine that
	 * emits a prompt before every line read.
	 *
	 * Params:
	 *  writePrompt = A function invoked each time a prompt is called for.
	 *  f = The file to wrap around.
	 */
	auto byLinePrompted(alias writePrompt, File)(File f)
	    if (is(typeof(writePrompt())))
	{
	    return ByLinePrompted!(writePrompt, File)(f);
	}

Example usage:

	auto input = stdin.byLinePrompted!({
		stdout.write("> "); 
		stdout.flush;
	}).parse;

	while (!input.empty) {
		auto token = input.front;
		...
	}

In essence, we write the prompt before constructing byLine. This also
defers any exceptions from the parser until the first call to .front.
The byLinePrompted wrapper gives us a nice API to hide away all of this
ugly mess.

It does entail a null check every invocation, but the CPU branch
predictor ought to be able to figure out pretty quickly that the null
check will always be false after the first time, so you shouldn't see a
significant performance hit.

T

-- 
Bomb technician: If I'm running, try to keep up.