Program logic bugs vs input/environmental errors

Fri Oct 31 13:31:52 PDT 2014

On Fri, Oct 31, 2014 at 08:15:17PM +0000, Kagamin via Digitalmars-d wrote:
> On Thursday, 16 October 2014 at 19:53:42 UTC, Walter Bright wrote:
> >On 10/15/2014 12:19 AM, Kagamin wrote:
> >>Sure, software is one part of an airplane, like a thread is a part
> >>of a process.  When the part fails, you discard it and continue
> >>operation. In software it works by rolling back a failed
> >>transaction. An airplane has some tricks to recover from failures,
> >>but still it's a "no fail" design you argue against: it shuts down
> >>parts one by one when and only when they fail and continues
> >>operation no matter what until nothing works and even then it still
> >>doesn't fail, just does nothing. The airplane example works against
> >>your arguments.
> >
> >This is a serious misunderstanding of what I'm talking about.
> >
> >Again, on an airplane, no way in hell is a software system going to
> >be allowed to continue operating after it has self-detected a bug.
> >Trying to bend the imprecise language I use into meaning the opposite
> >doesn't change that.
> 
> To better depict the big picture as I see it:
> 
> You suggest that a system should shutdown as soon as possible on first
> sign of failure, which can affect the system.
> 
> You provide the hospital in a hurricane example. But you don't praise
> the hospitals, which shutdown on failure, you praise the hospital,
> which continues to operate in face of an unexpected and uncontrollable
> disaster in total contradiction with your suggestion to shutdown ASAP.
> 
> You refer to airplane's ability to not shutdown ASAP and continue
> operation on unexpected failure as if it corresponds to your
> suggestion to shutdown ASAP. This makes no sense, you contradict
> yourself.

You are misrepresenting Walter's position. His whole point was that once
a single component has detected a consistency problem within itself, it
can no longer be trusted to continue operating and therefore must be
shutdown. That, in turn, leads to the conclusion that your system design
must include multiple, redundant, independent modules that perform that
one function. *That* is the real answer to system reliability.

Pretending that a failed component can somehow fix itself is a fantasy.
The only way you can be sure you are not making the problem worse is by
having multiple redundant units that can perform each other's function.
Then when one of the units is known to be malfunctioning, you turn it
off and fallback to one of the other, known-to-be-good, components.

T

-- 
Error: Keyboard not attached. Press F1 to continue. -- Yoon Ha Lee, CONLANG