Program logic bugs vs input/environmental errors
Walter Bright via Digitalmars-d
digitalmars-d at puremagic.com
Sun Sep 28 22:15:13 PDT 2014
On 9/28/2014 9:03 PM, Sean Kelly wrote:
> On Monday, 29 September 2014 at 02:57:03 UTC, Walter Bright wrote:
> Right. But if the condition that caused the restart persists, the process can
> end up in a cascading restart scenario. Simply restarting on error isn't
> necessarily enough.
When it isn't enough, use the "engage the backup" technique.
>> I don't believe that the way to get 6 sigma reliability is by ignoring errors
>> and hoping. Airplane software is most certainly not done that way.
>
> I believe I was arguing the opposite. More to the point, I think it's necessary
> to expect undefined behavior to occur and to plan for it. I think we're on the
> same page here and just miscommunicating.
Assuming that the program bug couldn't have affected other threads is relying on
hope. Bugs happen when the program went into an unknown and unanticipated state.
You cannot know, until after you debug it, what other damage the fault caused,
or what other damage caused the detected fault.
> My point was that it's often more complicated than that. There have been papers
> written on self-repairing systems, for example, and ways to design systems that
> are inherently durable when it comes to even internal errors.
I confess much skepticism about such things when it comes to software. I do know
how reliable avionics software is done, and that stuff does work even in the
face of all kinds of bugs, damage, and errors. I'll be betting my life on that
tomorrow :-)
Would you bet your life on software that had random divide by 0 bugs in it that
were just ignored in the hope that they weren't serious? Keep in mind that
software is rather unique in that a single bit error in a billion bytes can
render the software utterly demented.
Remember the Apollo 11 lunar landing, when the descent computer software started
showing self-detected faults? Armstrong turned it off and landed manually. He
wasn't going to bet his ass that the faults could be ignored. You and I
wouldn't, either.
> I think what I'm
> trying to say is that simply aborting on error is too brittle in some cases,
> because it only deals with one vector--memory corruption that is unlikely to
> reoccur. But I've watched always-on systems fall apart from some unexpected
> ongoing situation, and simply restarting doesn't actually help.
In such a situation, ignoring the error seems hardly likely to do any better.
More information about the Digitalmars-d
mailing list