The extent of trust in errors and error handling

Sun Feb 5 00:55:40 PST 2017

On Wednesday, 1 February 2017 at 19:25:07 UTC, Ali Çehreli wrote:
> tl;dr - Seeking thoughts on trusting a system that allows 
> "handling" errors.
>
> One of my extra-curricular interests is the Mill CPU[1]. A 
> recent discussion in that context reminded me of the 
> Error-Exception distinction in languages like D.
>
> 1) There is the well-known issue of whether Error should ever 
> be caught. If Error represents conditions where the application 
> is not in a defined state, hence it should stop operating as 
> soon as possible, should that also carry over to other 
> applications, to the OS, and perhaps even to other systems in 
> the whole cluster?
>

No, because your logic would then extend to all of the human 
race, to animals, etc. It is not practical and not necessary.

1. The ball must keep rolling. All of this stuff we do is fantasy 
anyways so if an error occurs in that lemmings game, it is just a 
game. It might take down every computer in the universe(if we 
went with the logic above) but it can't affect humans because 
they are distinct from computers(it might kill a few humans but 
that has always been acceptable to humans).

That is, it is not practical to take everything down because an 
error is not that serious and ultimately has limited affect.

That is, in the practical world, we are ok with some errors. This 
allows us not to worry to much. The more we would have to worry 
about such errors the more things would have to be shut down 
exactly because of the logic you have given. So, it is not a 
problem if "should we do x or not x" but how much of x is 
acceptable.

(The human race has decided that quite a bit of errors are ok. We 
can even have errors such as a medical device malfunctioning 
because some error like invalid array access kill people and it's 
ok(it's just money, and lawyers will be happy))

2. Not all errors will systematically propagate in to all other 
systems. e.g., two computers not connected to in any way. If one 
has an error, the other won't be affected so no reason to take 
that computer down too.

So, what matters, like anything else, is that we try to do the 
best we can. We don't have to pick an arbitrary point of when to 
stop because we actually don't know. What we do is use reason and 
experience to decide what is the most likely solution and see how 
much risk that has. If it has too much we back off, if not enough 
we back off.

There is an optimal point, more or less, because risk requires 
energy to manage(even for no risk).

Basically if you assume, like you seem to be doing, that a 
singular error creates an unstable state in the whole system at 
every point, then you are screwed from the get go if you do not 
any any unstable state at any cost. The only solution is to not 
have any errors at any point then. (which requires perfection, 
something humans gave up on trying to achieve a long time ago)

3. Things are not so cut and dry. Intelligence can be used to 
understand the problem. Not all errors are the simple. Some 
errors are catastrophic and need everything shut down and some 
don't. Knowing those error types is important. Hence, the more 
descriptive something is the better as it allows one create 
separation. Also, designing things to be robust is another way to 
mitigate the problems.

Programming is not much different than banking. You have a 
certain amount of risk in a certain portfolio(program), you hedge 
your bets(create a good robust design), and hope for the best. 
It's up to the individual to decide how much the hedging is 
required as it will require time/money to do it.

Example: Windows. Obviously windows was a design that didn't care 
too much about robustness. Just enough to get the job done was 
their motto. If someone dies because of some BSOD, it's not that 
big a deal... it will be hard to trace the cause, and if it can 
be done they have enough money to afford it. (similar to the ford 
fiasco 
https://en.wikibooks.org/wiki/Professionalism/The_Ford_Pinto_Gas_Tank_Controversy)