Towards a better conceptual model of exceptions (Was: Re: The Right Approach to Exceptions)

Tue Feb 21 00:15:48 PST 2012

All of this heated debate has led me to reconsider our whole concept of
exceptions. It seems that we're squabbling over little details in
existing paradigms. But what of the big picture? What *is* an exception
anyway? We all know the textbook definition, but clearly something is
missing since we can't seem to agree how it should be implemented.

DEFINITION

So I'd like to propose this definition: an exception is an abnormal
condition that causes a particular operation not to be completed, *but
which may have one or more ways of recovery*.

I'm not interested in problems with no method of recovery: we might as
well just terminate the program right there and call it a day. The fact
that there's such a thing as try-catch means that we're interested in
problems that we have *some* way of handling.

Before I proceed, I'd like to propose that we temporarily forget about
the current implementation details of exceptions. Let's for the time
being forget about class hierarchies, try-catch, Variant hashes, etc.,
and let's consider the *concept* of an exception.

In discussing what is an exception, we can jump into the nitty-gritty
details, and argue all day about how to handle the particulars of case X
and how to reconcile it with case Y, but I'd like to approach this from
a different angle.  Yes, if we know the nitty-gritty, then we can deal
with it in a nitty-gritty way.  Let it suffice to say that sometimes, we
*want* to get into the nitty-gritty of an exception, so in any
implementation there should be a way to do this.

But let's talk about the generics.  What if we *don't* know the
nitty-gritty? Can we still say something useful about exception handling
*independently* of the details of the exception? If module X calls
module Y and a problem happens, but X doesn't understand the
implementation details of Y or how the problem relates to it, can X
still do something meaningful to handle the problem?

To this end, I decided to categorize exceptions according to their
general characteristics, rather than their particulars. In making this
categorization, I'm not looking for artificial, ivory-tower
idealizations of exception types; I'm looking for categories that would
allow us to handle an unknown exception meaningfully. In other words,
categories for which *there are recovery methods are available* to
handle that exception. I don't care whether it's an I/O error, a network
error, or an out-of-memory error; what I want to know is, *what can I do
about it*?

TRANSITIVITY

One more point before I get into the categories themselves: in finding
useful categorizations of exceptions, it's useful to find categories
which are *transitive*. And by that I mean, a category C is transitive
if an exception E in this category remains in the same category when the
stack unwinds from Y, where E was first thrown, to X, which called Y.
For example, Andrei's is_transient property constitutes a transitive
category. If X calls Y and Y experiences a transient error, then by
extension X has also experienced a transient error. Therefore, the
is_transient category is transitive.

Why should we care if an exception category is transitive? Because it
allows us to make meaningful decisions far up the call stack from where
the actual problem happened. If an exception category is not transitive,
then as soon as the stack unwinds past the caller of the function which
threw the exception, we can no longer reasonably assume that the
exception still belongs to the same category as before. An illegal input
error, for example, is not necessarily transitive: if X calls Y and Y
calls Z, and Z says "illegal input!", then it means Y passed bad data to
Z, but it doesn't necessarily mean that X passed bad data to Y. It may
be Y's fault that the input to Z was bad. So trying to recover from the
bad data when the stack has unwound to X is not useful: X may not have
any way of changing what Y passes to Z. But if Y merely takes the data
passed to it by X and hands it to Z, then the illegal input *is*
transitive up to X. In *that* case, X can meaningfully attempt a
recovery by fixing the bad data. But if X was the one who created that
data, then X's caller cannot do anything about it. So input errors are
only conditionally transitive -- up to the origin of the input data.

CATEGORIES

Here are the categories I've found so far. I don't claim this is
anywhere near complete, but I'd like to put it on the table so that
y'all can discuss about it, and hopefully refine this idea better. Each
category is associated with a list of recovery actions that are
meaningful for that category. Note that it doesn't mean that *every*
exception in that category will have all listed recovery actions
available; some recovery actions aren't possible in all cases. In an
implementation, there would need to be a way to indicate which of the
listed recovery actions are actually possible given a particular
exception.

Also, the recovery actions are deliberately generic. This will be
explained later, but the idea is to let generic code decide on a general
course of action without knowing the details of the actual
implementation, which is determined by the code that triggered the
original condition, or by an intermediate handler midway up the call
stack who knows how to deal with the condition.

INPUT ERRORS:

Definition: errors that are caused by passing bad input to function X,
such that X doesn't know how to compute a meaningful output or execute a
meaningful operation.

Transitivity: conditional, only up to origin of the input data.

Recovery actions:
- Attempt to repair the input and try the operation again. Only possible
  if there exists a mechanism of repairing the input.

- Skip over the bad input and attempt to continue. Only applicable if
  the input is a list, say, and the program can still function (albeit
  not to the full extent) without the erroneous input. Otherwise, this
  recovery action is meaningless.

COMPONENT FAILURE:

Definition: the operation being attempted depends on the normal
functioning of sub-operations X1, X2, ..., but one (or more) of them
isn't functioning normally, so the operation can't be completed.

Transitivity: Yes. If X calls Y and Y calls Z, and one of Z's
subcomponents failed, then from Y's perspective, Z has also failed.

Recovery actions:
- Retry the operation using an alternative component, if one is
  available. For example, if X is a DNS resolver and Y is a
  non-responding DNS server, then X could try DNS server Z instead.
  Transitively, if X can't recover (doesn't know an alternative DNS
  server to try), then X's caller can attempt to bypass X and use W
  instead, which looks up a local DNS cache, say.  (Note that the
  high-level code that handles a component failure does not need to know
  the details of how this component swapping is implemented, or exactly
  what is being swapped in/out. It only knows that this is a possible
  course of action.)

CONSISTENCY ERROR:

Definition: the operation being attempted depends on a suboperation X,
which is operating normally, however, the result returned by X fails a
consistency check. (I'm not sure if this warrants separate treatment
from component failure; they are very similar and the recovery actions
are also very similar. Maybe they can be lumped together as a single
category.)

Examples: Numerical overflow/underflow, which would throw off any
remaining calculations.

Transitivity: Yes. If X calls Y and Y calls Z, and Y discovers the Z
produced an inconsistent result, then by extension Y would have also
produced an inconsistent result had it decided to blindly charge ahead
with the operation.

Recovery actions:
- Retry the operation by using an alternative component, if available.
  For example, a numerical overflow might be repaired by switching to a
  BigNum library for the troublesome part of the computation.

LACK OF RESOURCES:

Definition: the operation being attempted would have completed normally,
had there been sufficient resources available, but there aren't, so it
can't continue.

Transitivity: Yes. If X calls Y and Y runs out of resources to finish,
then by extension X doesn't have the resources to finish either.

Recovery actions:
- Free up some resources and try again. This one is debatable, since it
  may not be clear which resources need to be freed up, or whether they
  *can* be freed at all. If it's a full disk, for example, it would be
  unwise to just go and randomly delete files. But some cases can be
  handled, e.g., if memory runs out, trigger the GC. (But presumably the
  GC does this automatically, so this may not be an actual use case that
  needs manual handling.) All in all, this category may not be easy to
  recover from, so it may be of limited utility.

TRANSIENT ERROR:

Definition: the operation depends on component X, which is known to
sometimes fail. Example: a network server may sometimes go down due to
intermittent network problems, timeouts, etc..

Transitivity: Yes. If operation X calls operation Y and Y has a
transient error, then X also has a transient error by extension.

Recovery actions:
- Retry the operation: it may succeed next time.

Sidenote: Here I'd like to say that at first, I was very skeptical about
Andrei's is_transient proposal, because I didn't have the proper context
to understand its utility. I felt that something was missing.  And that
missing something was that is_transient is but a part of a larger
framework of generic exception categories. Without this larger context,
the value of is_transient is not immediately obvious. It seems like just
an arbitrary thing out of the blue. How could it possibly be useful??
But when viewed as part of a larger system, is_transient can be seen to
be an extremely useful concept: it is a *transitive* category, which
means you can do something meaningful with it at any point up the call
stack.

CREDENTIALS ERROR:

Definition: there's no problem with the input, and all subcomponents are
functioning properly, but because of lack of (or improper) credentials,
the operation could not be completed.

Transitivity: Yes(?). Not sure about this one, not because it doesn't
fit the definition, but because it's unclear how to correctly handle the
recovery action. A single operation may consist of many sub-operations,
each requiring a different set of credentials. Just because one of the
sub-operations raises a credentials error doesn't mean the exception
handler knows where to find alternative credentials, or even what kind
of credentials they are.

Recovery actions:
- Retry the operation with different credentials. E.g., prompt user for
  a different password. But I'm unsure if/how this can be generally
  implemented, as described above.

These are all the general categories I found. There may be more.

IMPLEMENTATION

Alright. All of this grand talk about generics and categories is all
good, but how can this actually be implemented in real life?

The try-catch mechanism is not adequate to implement all the recovery
actions described above. As I've said before when discussing what I
called the Lispian model, some of these recovery actions need to happen
*in the context where the exception was thrown*. Once the stack unwinds,
it may not be possible to recover anymore, because the execution context
of the original code is gone.

One peculiarity about Andrei's is_transient is that you *can* re-attempt
the operation after unwinding the stack. Which is what makes it useful
in the current try/catch exception system that we have.

But not all recovery actions can be implemented this way. Some, such as
repair bad input, or try alternate component, makes no sense after the
stack has unwound: the execution context of the failing component and
its caller is long gone; to try an alternate component would require
painstaking passing of retry information all the way down the function
call chain, polluting normal function parameters with retry parameters
and producing very ugly code. Repair bad input, in particular, *must* be
done before the stack unwinds past the origin of the input, otherwise
it's impossible to correct it.

This is where the Lispian model really shines. To summarize:

1) When we encounter a problem, we raise a Condition (instead of throw
an exception immediately).

2) Every Condition is associated with a set of recovery actions. These
actions are generic; basically we're mapping each exception category to
a Condition. The raiser of the Condition will specialize each recovery
action with code specific to itself.

3) High-level code may register Handlers (in the form of a delegate) for
particular Conditions.  These registrations are limited by scope; once
the function registering the handler exits, any handlers it registered
are removed from the system. The handler registered closest to the
origin of a Condition has priority over other matching handlers.

4) When a Condition is raised, the condition-handling system first
checks a list of registered condition handlers to see if any handler is
willing to handle the condition. The handler is passed the Condition
with its associated set of recovery options. The handler decides, based
on high-level information, which recovery action to take, and informs
the condition-handling system. The recovery action is then executed *in
the context of the function that raised the Condition*, *without
unwinding the stack*. If no handler is found, or the handler decides to
abort the operation, then the condition-handling system converts the
Condition into an exception and throws it. A function higher up the call
chain may decide to catch this exception and raise a corresponding
Condition, to allow (other) handlers to deal with the situation at the
higher level. If nothing is caught or all attempts to fix the problem
failed, we eventually percolate up the call stack to the top and fail
the program.

Advantages of this system:

- Complex recovery actions are possible, because we don't unwind the
  stack until we decide to abort the operation after all.

- Recovery actions run in the context where failure is first seen,
  thereby taking advantage of the immediate context to recover in a
  specific way.

- High-level code gets to make decisions about which recovery action to
  pursue (via the delegate handler). It gets to do this *without* need
  to know the nitty-gritty of the low-level code; it is given the
  generic problem category and a list of generic recovery actions that
  can be attempted. The low-level code implements various recovery
  options, the high-level code chooses between them.

- If nobody knows how to handle the situation, we unwind the stack, as
  in the traditional try/catch model.

- If an intermediate function up the stack has a way to deal with the
  situation, it can catch the associated exception and raise a Condition
  that has recovery actions *run at its level*. The high-level delegate
  still gets to make decisions, but now the recovery actions are run at
  a higher level than the original locus of the problem. In some cases,
  this is a better position for attempting recovery. E.g., a network
  timeout may be seen at the packet level, but to repair the problem
  requires reconnecting from, say, the HTTP request level, so we need to
  unwind the stack up until that point. This is actually superior to the
  try/catch mechanism, because at the HTTP request level, we don't
  necessarily have enough context to decide what course of action to
  take; but by passing the condition a higher-level delegate, it can
  make decisions the HTTP module can't make, and the HTTP module can
  correct the problem without unwinding the stack all the way to where
  the delegate was registered.

In a previous post, I had a skeletal implementation of this system, but
the major problem was that it was too specialized: every piece of code
that wanted to implement recovery needed to define a specific Condition
with its own set of recovery strategies, leading to reams and reams of
code just to achieve something simple. Furthermore, the high-level
handler needed to know the nitty-gritty low-level details of what each
Condition represented and what options are available to deal with it, so
there was no way to write a *generic* handler that can decide what to do
with conditions whose details it knows nothing about.

But by using generic exception categories, we can finally get rid of
that bloat and still be able to implement problem-specific recovery
strategies. The high-level code need only know which generic category
the Condition belongs to, and based on this it knows which recovery
actions are available. It never needs to know what the details are
(unless it's intended to be a very specific handler dealing with a very
specific condition whose details it knows). The low-level code provides
the implementation of the recovery actions by implementing the generic
interface of that particular category.

Currently, I'm still unsure whether Conditions and Exceptions should be
unified, or they should be kept separate; deadalnix recommended they be
kept separate, but I'd like to open it for discussion.

Sorry for this super-long post, but I wanted to lay my ideas out in a
coherent fashion so that we can discuss its conceptual aspects without
getting lost with arguing about the details. I hope this is a step in
the right direction toward a better model of exception handling.

T

-- 
Life is complex. It consists of real and imaginary parts. -- YHL