Program logic bugs vs input/environmental errors

Sat Oct 4 02:18:43 PDT 2014

On 10/3/2014 10:00 AM, Joseph Rushton Wakeling via Digitalmars-d wrote:
> What I'm asking you to consider is a use-case, one that I picked quite
> carefully.  Without assuming anything about how the system is architected, if we
> have a telephone exchange, and an Error occurs in the handling of a single call,
> it seems to me fairly unarguable that it's essential to avoid this bringing down
> everyone else's call with it.  That's not simply a matter of convenience -- it's
> a matter of safety, because those calls might include emergency calls, urgent
> business communications, or any number of other circumstances where dropping
> someone's call might have severe negative consequences.

What you're doing is attempting to write a program with the requirement that the 
program cannot fail.

It's impossible.

If that's your requirement, the system needs to be redesigned so that it can 
accommodate the failure of the program.

(Ignoring bugs in the program is not accommodating failure, it's pretending that 
the program cannot fail.)

> As I'm sure you realize, I also picked that particular use-case because it's one
> where there is a well-known technological solution -- Erlang -- which has as a
> key feature its ability to isolate different parts of the program, and to deal
> with errors by bringing down the local process where the error occurred, rather
> than the whole system.  This is an approach which is seriously battle-tested in
> production.

As I (and Brad) has stated before, process isolation, shutting down the failed 
process, and restarting the process, is acceptable, because processes are 
isolated from each other.

Threads are not isolated from each other. They are not. Not. Not.

> As I said, I'm not asking you to endorse catching Errors in threads, or other
> gross simplifications of Erlang's approach.  What I'm interested in are your
> thoughts on how we might approach resolving the requirement for this kind of
> stability and localization of error-handling with the tools that D provides.
>
> I don't mind if you say to me "That's your problem" (which it certainly is:-),
> but I'd like it to be clear that it _is_ a problem, and one that it's important
> for D to address, given its strong standing in the development of
> super-high-connectivity server applications.

The only way to have super high uptime is to design the system so that failure 
is isolated, and the failed process can be quickly restarted or replaced. 
Ignoring bugs is not isolation, and hoping that bugs in one thread doesn't 
affected memory shared by other threads doesn't work.