Program logic bugs vs input/environmental errors

Sun Sep 28 21:03:35 PDT 2014

On Monday, 29 September 2014 at 02:57:03 UTC, Walter Bright wrote:
>
> I've said that processes are different, because the scope of 
> the effects is limited by the hardware.
>
> If a system with threads that share memory cannot be restarted, 
> there are serious problems with the design of it, because a 
> crash and the necessary restart are going to happen sooner or 
> later, probably sooner.

Right.  But if the condition that caused the restart persists, 
the process can end up in a cascading restart scenario.  Simply 
restarting on error isn't necessarily enough.

> I don't believe that the way to get 6 sigma reliability is by 
> ignoring errors and hoping. Airplane software is most certainly 
> not done that way.

I believe I was arguing the opposite.  More to the point, I think 
it's necessary to expect undefined behavior to occur and to plan 
for it.  I think we're on the same page here and just 
miscommunicating.

> I recall Toyota got into trouble with their computer controlled 
> cars because of their idea of handling of inevitable bugs and 
> errors. It was one process that controlled everything. When 
> something unexpected went wrong, it kept right on operating, 
> any unknown and unintended consequences be damned.
>
> The way to get reliable systems is to design to accommodate 
> errors, not pretend they didn't happen, or hope that nothing 
> else got affected, etc. In critical software systems, that 
> means shut down and restart the offending system, or engage the 
> backup.

My point was that it's often more complicated than that.  There 
have been papers written on self-repairing systems, for example, 
and ways to design systems that are inherently durable when it 
comes to even internal errors.  I think what I'm trying to say is 
that simply aborting on error is too brittle in some cases, 
because it only deals with one vector--memory corruption that is 
unlikely to reoccur.  But I've watched always-on systems fall 
apart from some unexpected ongoing situation, and simply 
restarting doesn't actually help.