Program logic bugs vs input/environmental errors
Sean Kelly via Digitalmars-d
digitalmars-d at puremagic.com
Sun Sep 28 21:03:35 PDT 2014
On Monday, 29 September 2014 at 02:57:03 UTC, Walter Bright wrote:
>
> I've said that processes are different, because the scope of
> the effects is limited by the hardware.
>
> If a system with threads that share memory cannot be restarted,
> there are serious problems with the design of it, because a
> crash and the necessary restart are going to happen sooner or
> later, probably sooner.
Right. But if the condition that caused the restart persists,
the process can end up in a cascading restart scenario. Simply
restarting on error isn't necessarily enough.
> I don't believe that the way to get 6 sigma reliability is by
> ignoring errors and hoping. Airplane software is most certainly
> not done that way.
I believe I was arguing the opposite. More to the point, I think
it's necessary to expect undefined behavior to occur and to plan
for it. I think we're on the same page here and just
miscommunicating.
> I recall Toyota got into trouble with their computer controlled
> cars because of their idea of handling of inevitable bugs and
> errors. It was one process that controlled everything. When
> something unexpected went wrong, it kept right on operating,
> any unknown and unintended consequences be damned.
>
> The way to get reliable systems is to design to accommodate
> errors, not pretend they didn't happen, or hope that nothing
> else got affected, etc. In critical software systems, that
> means shut down and restart the offending system, or engage the
> backup.
My point was that it's often more complicated than that. There
have been papers written on self-repairing systems, for example,
and ways to design systems that are inherently durable when it
comes to even internal errors. I think what I'm trying to say is
that simply aborting on error is too brittle in some cases,
because it only deals with one vector--memory corruption that is
unlikely to reoccur. But I've watched always-on systems fall
apart from some unexpected ongoing situation, and simply
restarting doesn't actually help.
More information about the Digitalmars-d
mailing list