Foreach loop behaviour and manipulation

Tue Jan 7 15:05:03 PST 2014

On Tue, Jan 07, 2014 at 09:38:34PM +0000, Binarydepth wrote:
> 
> >or ... for(count=1;count>0 && count<100 || count>999 &&
> >count<10000 || etc...; count++ )
> 
> mmmm That won't work. It's better separate foreach loops.

Y'know, this code snippet really reminds me of why Jackson Structured
Programming helped me so much. While there are many ways of writing
loops, most ways are wrong (some blatantly so, while others only subtly
so). Writing a correct loop requires that the structure of the code
matches the structure of the problem that it's trying to process.

If you're trying to loop over two distinct ranges, 0 to 100 and 999 to
10000, then conceptually they are two different operations, and
therefore should be mapped to two different loops. Trying to combine
them into one usually creates a mess, and even when you get it right,
the resulting code is fragile, error-prone, and hard to maintain.

I often see code like this:

	bool first_time = false;
	for (i=0; i < n; i++) {
		if (first_time) {
			do_something();
			first_time = true;
		}

		if (i+1 == n) { // last time
			do_something_else();
		}

		do_other_things();
	}

This kind of code shows a poor mapping of code structure to problem
structure, which means it's prone to boundary condition bugs, overlap
bugs, and other sorts of problems, not to mention long-range
interdependencies that makes the code basically impossible to reuse, and
hard to maintain.

Consider what the loop is trying to do, by unrolling it. It looks like
this:

	do_something();
	do_other_things();
	do_other_things();
	do_other_things();
	...
	do_something_else();
	do_other_things();

If you "refactor" this, you see that it follows the structure:

	A --> B (repeat k times) --> C --> B

where A = do_something(), B = do_other_things(), and C =
do_something_else().

So you see that the repeating part in the structure of the problem is
really only with the middle part; A and the final C --> B should be put
*outside* the loop body, like this:

	do_something();
	for (i=1; i+1 < n; i++) {	// N.B.: loop bounds adjusted
		do_other_things();
	}
	do_something_else();
	do_other_things();

Many programmers, esp. those with C/C++ background, cringe when they see
the second call to do_other_things() after the loop body, and they try
various ways of putting it inside the loop body -- especially when
do_other_things() is a big code block. But actually, this way of writing
it is the correct way, because it reflects the structure of the problem
more accurately. Putting everything inside one big loop creates a
structure conflict with the problem, which has a different structure;
this invites people to introduce boolean flags and other such hacks to
make things work correctly. But actually, that's just stitching over the
symptoms of the deeper problem, which is that the structure of the code
fails to correspond with the structure of the problem.

By making the structure of the code match the structure of the problem,
we eliminate the proliferation of boolean flags and convoluted loop
conditions (which are very error-prone and basically impossible to
maintain), and the code begins to speak for itself, because just by
looking at the structure of the code, you know what's the structure of
the problem it's working on. Once the structure is properly sorted out,
then we can worry about other things, like do_other_things() being a big
copy-n-pasted code block -- which then basically suggests the obvious
solution: factor it into a function.

A classic example of poor mapping of code structure to problem structure
is join(): inserting (n-1) delimiters into a list of n items. I almost
always see code like this:

	foreach (i, e; range) {
		result.put(e);
		if (i+1 < range.length)
			result.put(',');
	}

But let's step back for a moment and look at what's the *real* structure
of the problem. Suppose range = [1,2,3,4,5]. Then the desired output is:

	1 , 2 , 3 , 4 , 5

Or, more abstractly, if we use 'e' in a generic sense to refer to any
item:

	e , e , e , e , e , e

which can be "factored" into the form:

	(e ,){(n-1) times} e

Due to the nature of the range API, however, this does not lend itself
to a straightforward implementation (there is no primitive for looping
over (n-1) items in a range). So we consider an alternative
factorization:

	e (, e){(n-1) times}

This, then, suggests that the first element of the range should be
treated specially, which leads to the following code:

	if (range.empty) return; // boundary case

	result.put(range.front);
	range.popFront();

	while (!range.empty) {
		result.put(',');
		result.put(range.front);
		range.popFront();
	}

This eliminates the if-statement with a tricky condition: should it be
(i+1 < range.length) or (i <= range.length) or (i < range.length-1)?
While this particular example is trivial to sort out, you'd be surprised
how often I run into code where the condition is wrongly-written,
resulting in off-by-1 errors. By getting rid of the condition
altogether, we eliminate this source of bugs (not to mention the code is
now free from the arbitrary requirement that the range must have a
.length primitive).  Plus, the code now more accurately reflects the
structure of the output.

Some people will object that the above code looks "more verbose",
"ugly", etc., but the fact of the matter is that it reflects the
structure of the problem more accurately than any of the cuter hacks
that tries to "reduce code bloat" by arbitrarily squashing
non-corresponding structures into a single loop, thus resulting in
poorly-motivated if-statements, needlessly complex loop conditions, and
other ad hoc hacks to patch over the structure conflict. The result is
often fragile and hard to maintain. Not to mention, having loops with
trivial conditions gives the optimizer a far easier time at generating
superior code than a loop with a condition too complex for the optimizer
to understand.

Now, there *are* cases where complex loop conditions are necessary, but
I'd say that 90% of loops don't need this and can be (and should be)
reduced to loops with trivial conditions. So whenever your loop
conditions start acquiring lots of &&'s and ||'s, that's a sign
indicating that your code structure no longer corresponds with the
problem structure, and it's time to rewrite your loops.

</soapbox> :-)

T

-- 
Those who don't understand Unix are condemned to reinvent it, poorly.