Program logic bugs vs input/environmental errors

Sun Sep 28 22:15:13 PDT 2014

On 9/28/2014 9:03 PM, Sean Kelly wrote:
> On Monday, 29 September 2014 at 02:57:03 UTC, Walter Bright wrote:
> Right.  But if the condition that caused the restart persists, the process can
> end up in a cascading restart scenario.  Simply restarting on error isn't
> necessarily enough.

When it isn't enough, use the "engage the backup" technique.

>> I don't believe that the way to get 6 sigma reliability is by ignoring errors
>> and hoping. Airplane software is most certainly not done that way.
>
> I believe I was arguing the opposite.  More to the point, I think it's necessary
> to expect undefined behavior to occur and to plan for it.  I think we're on the
> same page here and just miscommunicating.

Assuming that the program bug couldn't have affected other threads is relying on 
hope. Bugs happen when the program went into an unknown and unanticipated state. 
You cannot know, until after you debug it, what other damage the fault caused, 
or what other damage caused the detected fault.

> My point was that it's often more complicated than that.  There have been papers
> written on self-repairing systems, for example, and ways to design systems that
> are inherently durable when it comes to even internal errors.

I confess much skepticism about such things when it comes to software. I do know 
how reliable avionics software is done, and that stuff does work even in the 
face of all kinds of bugs, damage, and errors. I'll be betting my life on that 
tomorrow :-)

Would you bet your life on software that had random divide by 0 bugs in it that 
were just ignored in the hope that they weren't serious? Keep in mind that 
software is rather unique in that a single bit error in a billion bytes can 
render the software utterly demented.

Remember the Apollo 11 lunar landing, when the descent computer software started 
showing self-detected faults? Armstrong turned it off and landed manually. He 
wasn't going to bet his ass that the faults could be ignored. You and I 
wouldn't, either.

> I think what I'm
> trying to say is that simply aborting on error is too brittle in some cases,
> because it only deals with one vector--memory corruption that is unlikely to
> reoccur.  But I've watched always-on systems fall apart from some unexpected
> ongoing situation, and simply restarting doesn't actually help.

In such a situation, ignoring the error seems hardly likely to do any better.