A Class in Composition

Wed Aug 28 11:28:41 PDT 2013

On Tue, Aug 27, 2013 at 04:13:24PM +0200, Chris wrote:
> I had a very simple method that would read a text file, parse it and
> create a lexicon of the type string[string].
> 
> public void loadLexicon(ref string[string] lex, string src)
> 
> It was high time I made the method more sophisticated and flexible
> (allowing for comments in the source file, checking for formatting
> errors etc). Because neither the comment bit (C-style line comments
> "//") nor the formatting check are too demanding I decided to do it
> in the existing loop that looked something like:
> 
> while (file.readln(buf)) {
>   line = to!string(buf);
>   // Do something with line
> }
> 
> Soon, very soon indeed, I ran into all the problems associated with
> loops, which basically boils down to "Where the f.. am I now?".

Yep, this is a typical structure conflict caused by the mismatch between
the lines in a file and the structures they represent. :)

> So I said to myself that component programming was the perfect fit for
> this kind of problem. I rearranged the program and now I have one line
> in the function that does the job:
> 
> public void loadLexicon(ref string[string] lex, string src) {
>   // ...
>   // arr is of type string[] and holds the lines of the source file.
>   auto dictionary = Lexicon(); // The output range
>   lex = arr.byCommentFilter().byEntry().copy(dictionary).lexicon;
> }
> 
> It's very nice indeed. The beauty of it is not only that I now have
> a nice, sequentially ordered "one-liner", the outsourcing of checks
> and filters freed my head from the loop logic, which helped me to
> focus on the respective algorithms. This lead to a leaner and
> cleaner implementation of each algorithm, because there are no
> dependecies on loop conditions or the position in the loop.

I usually separate out the loop body into smaller functions that handle
each part of the parsing. Since code is the most straightforward when
its structure matches that of the data, this usually means the outer
loop no longer iterates over lines, but over larger, logical structures
(e.g., an entry in your lexicon). That way, your code becomes something
like:

	void loadLexicon(...) {
		auto dictionary = Lexicon();
		do {
			dictionary.put(parseEntry(input, ...));
		} while (!input.empty);
	}

	Entry parseEntry(...) {
		skipDelimitingBlankLines(...);
		auto key = parseHeadword(...);
		auto value = parseBody(...);
		return Entry(key, value);
	}

	Key parseHeadword(...) { ... }

	Value parseBody(...) { ... }

This way, your code structure has 1-to-1 correspondence with the format
of your input, which makes it far more readable and less bug-prone.

Of course, we can then take this to the next level, which is to
encapsulate each of these functions into a range-based component, and so
we end up with your nice one-line function. :)

> I could easily remove the comment filter, if the deployment version
> of the program ships with tidied up source files, or I could add a
> new component if necessary. In a loop, however trivial it may appear
> at first glance, it would not be so simple to add or remove parts of
> the logic.

Yeah, when loops get beyond the simplest incarnations, their complexity
quickly explodes into something unmanageable, usually resulting in tight
coupling between distant parts of the code. That makes it very hard (or
outright impossible) to add/remove parts.

> One drawback is, that using ranges creates some overheads and code
> duplication. But the neatness of it is amazing, and I daresay that a
> lot of bugs are hidden in loops simply because the loop logic
> distracts and confuses the programmer and bugs finally find
> loopholes they can slip through.

Overly complex loops make a programmer go loopy and bugs crawl out from
loopholes... ;-)

As for overheads, I wonder if, given enough experience over time, we can
one day identify common patterns that the compiler can take advantage of
and optimize into more efficient code. For example, if the compiler
recognizes a linear composition of range-based components, it could
transform them into optimized nested loops that eliminate some of the
overhead involved. I'm not sure how feasible this is with the current
DMD implementation, though. But it's certainly something to keep in mind
for the future.

> Another drawback is that ranges demand a lot of boilerplate code. If
> properly implemented, there is a lot of code you have to write over
> and over again such as
> 
> if (isInputRange!Range && isInputRange!(ElementType!Range) &&
>         is(ElementType!(ElementType!Range) == MyType))
> 
> I don't know if this could possibly be reduced. Or maybe it's just
> my lack of experience.

You can encapsulate this into a template:

	template isRoR(R, ElemType) {
		enum isRoRMyType = isInputRange!R &&
			isInputRange!(ElementType!R) &&
			is(ElementType!(ElementType!R) == ElemType);
	}

	// Or, once 2.064 is out:
	enum isRoRMyType(R) = isInputRange!R &&
			isInputRange!(ElementType!R) &&
			is(ElementType!(ElementType!R) == ElemType);

	// Now the sig constraint is much more readable and easier to
	// type.
	auto myFunc(R)(R range)
		if (isRoR!(R, MyType))
	{ ... }

Another source of boilerplate is in repeatedly defining .empty, .front,
and .popFront. Every single input range, unless it can be defined as a
composition of existing Phobos ranges, must have this boilerplate:

	struct MyRange(T) {
		@property bool empty() { ... }
		@property T front() { ... }
		void popFront() { ... }
	}

And once you need a forward range, you need to also add:

	@property typeof(this) save() { ... }

There might be a way to reduce the amount of boilerplate by using mixins
or factory functions, though.

T

-- 
When solving a problem, take care that you do not become part of the problem.