Program logic bugs vs input/environmental errors

Fri Oct 31 14:04:48 PDT 2014

On Fri, Oct 31, 2014 at 08:23:04PM +0000, Kagamin via Digitalmars-d wrote:
> On Friday, 24 October 2014 at 18:47:59 UTC, H. S. Teoh via Digitalmars-d
> wrote:
> >Basically, if you want a component to recover from a serious problem
> >like a failed assertion, the recovery code should be in a *separate*
> >component. Otherwise, if the recovery code is within the failing
> >component, you have no way to know if the recovery code itself has
> >been compromised, and trusting that it will do the right thing is
> >very dangerous (and is what often leads to nasty security exploits).
> >The watcher must be separate from the watched, otherwise how can you
> >trust the watcher?
> 
> You make process isolation sound like a silver bullet, but failure can
> happen on any scale from a temporary variable to global network. You
> can't use process isolation to contain a failure of a larger than
> process scale, and it's an overkill for a failure of a temporary
> variable scale.

You're missing the point. The point is that a reliable system made of
unreliable parts, can only be reliable if you have multiple *redundant*
copies of each component that are *decoupled* from each other.

The usual unit of isolation at the lowest level is that of a single
process, because threads within a process has full access to memory
shared by all threads. Therefore, they are not decoupled from each
other, and therefore, you cannot put any confidence in the correct
functioning of other threads once a single thread has become
inconsistent. The only failsafe solution is to have multiple redundant
processes, so when one process becomes inconsistent, you fallback to
another process, *decoupled* process that is known to be good.

This does not mean that process isolation is a "silver bullet" -- I
never said any such thing. The same reasoning applies to larger
components in the system as well. If you have a server that performs
function X, and the server begins to malfunction, you cannot expect the
server to fix itself -- because you don't know if a hacker hasn't rooted
the server and is running exploit code instead of your application. The
only 100% safe way to recover, is to have another redundant server (or
more) that also performs function X, shutdown the malfunctioning server
for investigation and repair, and in the meantime switch over to the
redundant server to continue operations. You don't shutdown the *entire*
network unless all redundant components have failed.

The reason you cannot go below the process level as a unit of redundancy
is because of coupling. The above design of failing over to a redundant
module only works if the modules are completely decoupled from each
other. Otherwise, you have end up with the situation where you have two
redundant modules M1 and M2, but both of them share a common helper
module M3. Then if M1 detects a problem, you cannot be 100% sure it's
not caused by a problem with M3, so in this case if you just switch to
M2, it will also fail in the same way. Similarly, you cannot guarantee
that while malfunctioning, M1 may have somehow damaged M3, and thereby
also making M2 unreliable. The only way to be 100% sure that failover
will actually fix the problem, is to make sure that M1 and M2 are
completely isolated from each other (e.g., by having two redundant
copies of M3 that are isolated from each other).

Since a single process is the unit of isolation in the OS, you can't go
below this granularity: as I've already said, if one thread is
malfunctioning, it may have trashed the data shared by all other threads
in the same process, and therefore none of the other threads can be
trusted to continue operating correctly. The only way to be 100% sure
that failover will actually fix the problem, is to switch over to
another process that you *know* is not coupled to the old,
malfunctioning process.

Attempting to have a process "fix itself" after detecting an
inconsistency is unreliable -- you're leaving it up to chance whether or
not the attempted recovery will actually work, and not make the problem
worse. You cannot guarantee the recovery code itself hasn't been
compromised by the failure -- because the recovery code exists in the
same process and is vulnerable to the same problem that caused the
original failure, and vulnerable to memory corruption caused by
malfunctioning code prior to the point the problem was detected.
Therefore, the recovery code is not trustworthy, and cannot be relied on
to continue operating correctly. That kind of "maybe, maybe not"
recovery is not what I'd want to put any trust in, especially when it
comes to critical applications that can cost lives if things go wrong.

T

-- 
English has the lovely word "defenestrate", meaning "to execute by
throwing someone out a window", or more recently "to remove Windows from
a computer and replace it with something useful". :-) -- John Cowan