Uncaught Exception Handling #32

rowanmanning · 2022-05-30T19:41:21Z

rowanmanning
May 30, 2022
Maintainer

Note
Discussion has moved to the GitHub issue here, which proposes a solution.

I've been thinking a lot about uncaught exceptions and doing a few experiments to better understand how it works in Node.js. I think we're going to have to do more work than I initially thought to get uncaught exceptions logged consistently but I think it's worth doing.

What's the problem we're trying to solve?

When an exception is thrown and not caught or when a promise is rejected without being handled, a Node.js process will crash. When Node.js crashes we want to know exactly why, and we want the details of the crash to be consistent with the way we log errors.

We want to be able to search Splunk for errors which caused a crash in the same way we search for those which resulted in a 500 error page.

How do we solve it?

On the surface this should be relatively easy to solve (especially now we have consistent error logging). Node.js provides two events on process named uncaughtException and unhandledRejection. These allow you to execute some code between the error occurring and the app crashing:

const { logUnhandledError } = require('@dotcom-reliability-kit/log-error');

function uncaughtErrorHandler(error) {
    logUnhandledError({ error });
    process.exit(process.exitCode || 1);
}

process.on('uncaughtException', uncaughtErrorHandler);
process.on('unhandledRejection', uncaughtErrorHandler);

This could be nicely extracted into a library, so that each app can register these handlers by choice and potentially extend the behaviour to perform their own tear-down:

// Not really a thing yet
const uncaughtExceptionHandler = require('@dotcom-reliability-kit/uncaught-exception-handler');

process.on('uncaughtException', uncaughtErrorHandler);
process.on('unhandledRejection', uncaughtErrorHandler);

However there are two issues with the above approach.

Issue 1: async logging (solved)

The first issue is that, in production, we currently log everything asynchronously to Splunk via our splunkHEC transport. The Node.js documentation is clear about the correct use of these handlers:

The correct use of 'uncaughtException' is to perform synchronous cleanup of allocated resources (e.g. file descriptors, handles, etc) before shutting down the process. It is not safe to resume normal operation after 'uncaughtException'.

We can't guarantee that logs have sent before we exit the process in our handler. Winston does not allow you to hook in and run code once you're certain that a log has been sent and so we'd have to add in a nasty (and unreliable) hack like this:

function uncaughtErrorHandler(error) {
    // logUnhandledError goes here
    setTimeout(() => {
        process.exit(process.exitCode || 1);
    }, 200); // or whatever number of milliseconds we decide to allow
}

I think the correct solution to this problem is to ditch our custom Splunk logger (as is the plan anyway) and log to stdout, using Heroku log drains to forward our logs to Splunk.

We could accept a dirty hack like I outline above if we know that it's temporary and will be removed once we switch to Heroku log drains. However we know hacks can stick around for a lot longer than we intend Since writing this we have begun to migrate to Heroku log drains, so this issue goes away in the near future.

Issue 2: n-raven (part-solved)

The second issue is that, in production, we automatically register a global uncaughtException handler via the Sentry install method here. This is included as a non-optional part of n-express, which is used by all of our apps.

This means that regardless of whether we solve issue 1, it's always possible that Sentry beats us to writing logs and exits the process before our uncaught exception handler has a chance to.

My opinion is that importing n-express or n-raven should not make modifications to a global object like process and that we should remove this behaviour. I think it's better if the application (and our template) gets to make a decision about which uncaught exception handlers we register.

Getting to this point is more work. I think, as we need to switch to the new Sentry client anyway (see note here), that we should use this opportunity to deprecate or overhaul n-raven and switch apps over to the new Sentry Node.js Platform which does a lot less magic.

We'd then be able to roll Sentry logging into our own uncaught exception handler alongside our own Splunk logging format instead of magically registering it all as part of n-express. We could also add Sentry logging into our Express error logging middleware, further reducing the amount of code in n-express which probably should have never been there.

We're going to look into deprecating n-raven and providing an alternative in Reliability Kit. One possible approach is documented here, there will be updates when we've finished investigating how much our engineers use Sentry.

Feedback

I'd like to hear if you have alternative approaches to suggest, or if you know something about our uncaught exception handling that I don't.

JSRedondo · 2022-05-31T07:53:07Z

JSRedondo
May 31, 2022

Great initiative!
Regarding the Winston logging, actually, I think a hook can be implemented.

function waitForLogger(logger) {
    return new Promise((resolve) => {
        logger.logger.on('close', resolve);
	logger.logger.close();
   });
}
   
async function uncaughtErrorHandler(error) {
    logger.error({
        error: serializeError(error)
    });
   await waitForLogger(logger);
   process.exit(process.exitCode || 1);
}

There is an example in the Winston repository. It's used on('finish') event: finish-events.js

0 replies

JSRedondo · 2022-05-31T07:59:55Z

JSRedondo
May 31, 2022

Regarding Sentry, the official implementation is a bit different from ours.
I think we have a window of improvement in this case.

Raven.install(function(err, sendErr, eventId) {
  if (!sendErr) {
    console.log(
      "Successfully sent fatal error with eventId " + eventId + " to Sentry:"
    );
    console.error(err.stack);
  }
  console.log("This is thy sheath; there rust, and let me die.");
  process.exit(1);
});

0 replies

rowanmanning · 2022-07-28T10:09:30Z

rowanmanning
Jul 28, 2022
Maintainer Author

@kavanagh suggested something which it's worth capturing. When it comes to the API for our error handling, we may later want to integrate some more complex/magical things like Sentry in future. This would be difficult if we tied ourselves to the suggested API of exposing a handler and requiring apps to bind it to handlers, e.g.

const uncaughtExceptionHandler = require('@dotcom-reliability-kit/uncaught-exception-handler');
process.on('uncaughtException', uncaughtErrorHandler);
process.on('unhandledRejection', uncaughtErrorHandler);

It's also error-prone and a little boilerplatey.

It might be more sensible if the API is something like this:

const uncaughtExceptionHandler = require('@dotcom-reliability-kit/uncaught-exception-handler');

uncaughtExceptionHandler.bind({
    useSentry: true, // opt into or out of Sentry
    exitCode: 1, // just an example of other configurable stuff
    onError: error => {
        // allow the app to inject extra behaviour when an uncaught exception occurs
    }
});

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uncaught Exception Handling #32

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uncaught Exception Handling #32

Uh oh!

Uh oh!

rowanmanning May 30, 2022 Maintainer

What's the problem we're trying to solve?

How do we solve it?

Issue 1: async logging (solved)

Issue 2: n-raven (part-solved)

Feedback

Replies: 3 comments

Uh oh!

Uh oh!

JSRedondo May 31, 2022

Uh oh!

Uh oh!

JSRedondo May 31, 2022

Uh oh!

Uh oh!

rowanmanning Jul 28, 2022 Maintainer Author

rowanmanning
May 30, 2022
Maintainer

JSRedondo
May 31, 2022

JSRedondo
May 31, 2022

rowanmanning
Jul 28, 2022
Maintainer Author