Non-null objects, the Null Object pattern, and T.init

H. S. Teoh hsteoh at quickfur.ath.cx
Fri Jan 17 19:05:50 PST 2014


On Sat, Jan 18, 2014 at 02:22:22AM +0000, digitalmars-d-bounces at puremagic.com wrote:
> On Saturday, 18 January 2014 at 01:46:55 UTC, Walter Bright wrote:
[...]
> >Consider also the Toyota. My understanding from reading reports
> >(admittedly journalists botch up the facts) is that a single
> >computer controls the brakes, engine, throttle, ignition switch,
> >etc. Oh joy. I wouldn't want to be in that car when it keeps on
> >going despite having self-detected faults.
> 
> So you would rather have the car drive off the road because the
> anti-skid software abruptly turned itself off during an emergency
> manoeuvre?
[...]

You missed his point. The complaint is that the car has a *single*
software system that handles everything. That's a single point of
failure. When that single software system fails, *everything* fails.

A fault-tolerant design demands at least two anti-skid software units,
where the redundant unit will kick in when the primary one turns off or
stops for whatever reason. So when a software fault occurs in the
primary unit, it gets shut off, and the backup unit takes over and keeps
the car stable.  You'd only crash in the event that *both* units fail at
the same time, which is far less likely than a single unit failing.

This is better than having a single software system that tries to fix
itself when it goes wrong, because the fact that something caused part
of the code to crash (segfault, or whatever) is a sign that the system
is no longer in a state anticipated by the engineers, so there's no
guarantee it won't make things worse when it tries to fix itself. For
example, it might be scrambled into a state where it keeps the
accelerator on with no way to override it, thereby making the problem
worse.

You need a *decoupled* redundant system to be truly confident that
whatever fault caused the problem in the first system doesn't also
affect the backup / self-repair system, something which doesn't hold for
a single software unit (for example, if the power supply to the unit
fails, then whatever self-repair subsystem it has will also be
non-functional). That way, when the first unit goes wrong, it can simply
be shut off safely, thereby preventing making the problem worse, and the
backup unit takes over and keeps things going.

To use a software example: if you have a single process that tries to
fix itself when, say, a null pointer is dereferenced, then there's no
guarantee that the error recovery code won't do something stupid, like
format your disk (because the null pointer in an unexpected place proves
that the code has logic problems: it isn't in a state that the engineers
planned for, so who knows what else is wrong with it -- maybe a function
pointer to display graphics has been accidentally replaced with a
pointer to the formatDisk function due to the bug that caused the null
to appear in an unexpected place). If instead you have two redundant
processes, one of which is doing the real work and the second is just
sleeping, then when the first process segfaults due to a null pointer,
the second one can kick into action -- since it hasn't been doing the
same thing as the first process, it's likely still in a safe, consistent
state, and so it can safely take over and keep the service running.

This is the premise of high-availability systems: there's a primary
server that's doing the real work, and one or more redundant units. When
the primary dies (power loss, CPU overheat, segfault causing it no
longer to respond, etc.), a watchdog timer triggers a failover to the
second unit, thus minimizing service interruption time. The failover
detection code can then contact an administrator (email, SMS, etc.)
notifying that something went wrong with the first unit, and service
continues uninterrupted while the first unit is repaired.

OTOH, if you have only a single unit and something goes wrong, there's a
risk that the recovery code will go wrong too, so the entire unit stops
functioning, and service is interrupted until it's repaired.


T

-- 
Study gravitation, it's a field with a lot of potential.


More information about the Digitalmars-d mailing list