Program logic bugs vs input/environmental errors

Sun Sep 28 19:57:02 PDT 2014

On 9/28/2014 6:39 PM, Sean Kelly wrote:
> Well... suppose you design a system with redundancy such that an error in a
> specific process isn't enough to bring down the system.  Say it's a quorum
> method or whatever.  In the instance that a process goes crazy, I would argue
> that the system is in an undefined state but a state that it's designed
> specifically to handle, even if that state can't be explicitly defined at design
> time.  Now if enough things go wrong at once the whole system will still fail,
> but it's about building systems with the expectation that errors will occur.
> They may even be logic errors--I think it's kind of irrelevant at that point.
>
> Even a network of communicating processes, one getting in a bad state can
> theoretically poison the entire system and you're often not in a position to
> simply shut down the whole thing and wait for a repairman.  And simply rebooting
> the system if it's a bad sensor that's causing the problem just means a pause
> before another failure cascade.  I think any modern program designed to run
> continuously (increasingly the typical case) must be designed with some degree
> of resiliency or self-healing in mind.  And that means planning for and limiting
> the scope of undefined behavior.

I've said that processes are different, because the scope of the effects is 
limited by the hardware.

If a system with threads that share memory cannot be restarted, there are 
serious problems with the design of it, because a crash and the necessary 
restart are going to happen sooner or later, probably sooner.

I don't believe that the way to get 6 sigma reliability is by ignoring errors 
and hoping. Airplane software is most certainly not done that way.

I recall Toyota got into trouble with their computer controlled cars because of 
their idea of handling of inevitable bugs and errors. It was one process that 
controlled everything. When something unexpected went wrong, it kept right on 
operating, any unknown and unintended consequences be damned.

The way to get reliable systems is to design to accommodate errors, not pretend 
they didn't happen, or hope that nothing else got affected, etc. In critical 
software systems, that means shut down and restart the offending system, or 
engage the backup.

There's no other way that works.