D - Unsafe and doomed

Sun Jan 5 20:15:24 PST 2014

On Mon, Jan 06, 2014 at 02:24:09AM +0000, digitalmars-d-bounces at puremagic.com wrote:
> On Sunday, 5 January 2014 at 15:19:15 UTC, H. S. Teoh wrote:
> >Isn't that usually handled by running the webserver itself as a
> >separate process, so that when the child segfaults the parent returns
> >HTTP 501?
> 
> You can do that. The hard part is how to deal with the other 99
> non-offending concurrent requests running in the faulty process.

Since a null pointer implies that there's some kind of logic error in
the code, how much confidence do you have that the other 99 concurrent
requests aren't being wrongly processed too?

> How does the parent process know which request was the offending,
> and what if the parent process was the one failing, then you should
> handle it in the front-end-proxy anyway?

Usually the sysadmin would set things up so that if the front-end proxy
dies, it would be restarted by a script in (hopefully) a clean state.

> Worse, cutting off all requests could leave trash around in the
> system where requests write to temporary data stores where it is
> undesirable to implement a full logging/cross-server transactional
> mechanism. That could be a DoS vector.

I've had to deal with this issue before at my work (it's not related to
webservers, but I think the same principle applies). There's a daemon
that has to run an operation to clean up a bunch of auxiliary data after
the user initiates the removal of certain database objects. The problem
is, some of the cleanup operations are non-trivial, and has possibility
of failure (could be an error returned from deep within the cleanup
code, or a segfault, or whatever).  So I wrote some complex scaffolding
code to catch these kinds of problems, and to try to clean things up
afterwards.

But eventually we found that attempting this sort of error recovery is
actually counterproductive, because it made the code more complicated,
and added intermediate states: in addition to "object present" and
"object deleted", there was now "object partially deleted" -- now all
code has to detect this and decide what to do with it.  Then customers
started seeing the "object partially deleted" state, which was never
part of the design of the system, which led to all sorts of odd
behaviour (certain operations don't work, the object shows up in some
places but not others, etc.). Finally, we decided that it's better to
keep the system in simple, well-defined states (only "object present"
and "object not present"), even if it comes at the cost of leaving stray
unreferenced data lying around from a previous failed cleanup operation.

Based on this, I'm inclined to say that if a web request process
encountered a NULL pointer, it's probably better to just reset back to a
known-good state by restarting. Sure it leaves a bunch of stray data
around, but reducing code complexity often outweighs saving wasted
space.

> >HTTP link? I rather the process segfault immediately rather than
> >continuing to run when it detected an obvious logic problem with
> >its own code).
> 
> And not start up again, keeping the service down until a bugfix
> arrives?

No, usually you'd set things up so that if the webserver goes down, an
init script would restart it. Restarting is preferable, because it
resets the program back to a known-good state. Continuing to barge on
when something has obviously gone wrong (null pointer where it's not
expected) is risky, because what if that null pointer is not due to a
careless bug, but a symptom of somebody attempting to inject a root
exploit?  Blindly continuing will only play into the hand of the
attacker.

> A null pointer error can be a innocent bug for some services, so I
> don't think the programming language should dictate what you do,
> though you probably should have write protected code-pages with
> execute flag.

The thing is, a null pointer error isn't just an exceptional condition
caused by bad user data; it's a *logic* error in the code. It's a sign
that something is wrong with the program logic. I don't consider that an
"innocent error"; it's a sign that the code can no longer be trusted to
do the right thing anymore. So, I'd say it's safer to terminate the
program and have the restart script reset the program state back to a
known-good initial state.

> E.g. I don't think it makes sense to shut down a trivial service
> written in "Python" if it has a logic flaw that tries to access a
> None pointer for a specific request if you know where in the code it
> happens. It makes sense to issue an exception, catch it in the
> request handler free all temporary allocated resources and tell the
> offending client not to do that again and keep the process running
> completing all other requests. Otherwise you have a DoS vector?

Tell the client not to do that again? *That* sounds like the formula for
a DoS vector (a rogue client deliberately sending the crashing request
over and over again).

> It should be up to the application programmer whether the program
> should recover and complete the other 99 concurrent requests before
> resetting, not the language. If one http request can shut down the
> other 99 requests in the process then it becomes a DoS vector.

I agree with the principle that the programmer should decide what
happens, but I think there's a wrong assumption here that the *program*
is fit to make this decision after encountering a logic error like an
unexpected null pointer. Again, it's not a case of bad user input, where
the problem is just with the data and you can just throw away the bad
data and start over. This is a case of a problem with the *code*, which
means you cannot trust the program will continue doing what you designed
it to -- the null pointer proves that the program state *isn't* what you
assumed it is, so now you can no longer trust that any subsequent code
will actually do what you think it should do.

This kind of misplaced assumption is the underlying basis for things
like stack corruption exploits: under normal circumstances your function
call will simply return to its caller after it finishes, but now, it
actually *doesn't* return to the caller. There's no way you can predict
where it will go, because the fundamental assumptions about how the
stack works no longer hold due to the corruption. Blindly assuming that
things will still work the way you think they work, will only lead to
your program running the exploit code that has been injected into the
corrupted stack.

The safest recourse is to reset the program back to a known state.

T

-- 
People say I'm arrogant, and I'm proud of it.