Program logic bugs vs input/environmental errors

Mon Sep 29 12:09:32 PDT 2014

On Monday, 29 September 2014 at 05:15:14 UTC, Walter Bright wrote:
>
> I confess much skepticism about such things when it comes to 
> software. I do know how reliable avionics software is done, and 
> that stuff does work even in the face of all kinds of bugs, 
> damage, and errors. I'll be betting my life on that tomorrow :-)
>
> Would you bet your life on software that had random divide by 0 
> bugs in it that were just ignored in the hope that they weren't 
> serious? Keep in mind that software is rather unique in that a 
> single bit error in a billion bytes can render the software 
> utterly demented.

I'm not saying the errors should be ignored, but rather that
there are other approaches to handling errors besides (or in
addition to) terminating the process.  For me, the single most
important thing is detecting errors as soon as possible so
corrective action can be taken before things go too far south (so
hooray for contracts!).  From there, the proper response depends
on the error detected and the type of system I'm working on.
Like with persistent stateful systems, even if a restart occurs
can you assume that the persisted state is valid?  With a mesh of
communicating systems, if one node goes insane, what impact might
it have on other nodes in the network?  I think the definition of
what constitutes an interdependent system is application defined.

And yes, I know all about tiny bugs creating insane problems.
With event-based asynchronous programming, the most common
serious bugs I encounter memory corruption problems from dangling
pointers, and the only way to find and fix these is by analyzing
gigabytes worth of log files to try and unpack what happened
after the fact.  Spending a day looking at the collateral damage
from what ultimately turns out to be a backwards conditional
expression in an error handler somewhere gives a pretty healthy
respect for the brittleness of memory unsafe code.  This is one
area where having a GC is an enormous win.

> Remember the Apollo 11 lunar landing, when the descent computer 
> software started showing self-detected faults? Armstrong turned 
> it off and landed manually. He wasn't going to bet his ass that 
> the faults could be ignored. You and I wouldn't, either.

And this is great if there's a human available to take over.  But
what if this were a space probe?

>> I think what I'm trying to say is that simply aborting on 
>> error is too brittle in some cases, because it only deals with 
>> one vector--memory corruption that is unlikely to reoccur.  
>> But I've watched always-on systems fall apart from some 
>> unexpected ongoing situation, and simply restarting doesn't 
>> actually help.
>
> In such a situation, ignoring the error seems hardly likely to 
> do any better.

Again, not ignoring, but rather that a restart may not be the
appropriate response to the problem.  Or it may be a part of the
appropriate response, but other things need to happen as well.