Unexpected Pod Restarts Caused by Unresponsive Node.js Process (MCP / OAuth) #12078
Unanswered
jannickHo
asked this question in
Troubleshooting
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Observed Behavior
Our LibreChat instance running in Kubernetes has been restarting repeatedly — 2 times on the day of reporting, 7 times the day before. Concurrent user load at the time was approximately 10 users.
The restarts were forced by Kubernetes: the liveness probe (10s timeout, 5 retries) detected that the health endpoint had stopped responding. Graceful shutdown did not complete successfully, so Kubernetes escalated to a hard SIGTERM. This indicates the Node.js process had either crashed or become fully unresponsive before the SIGTERM was sent.
In 3 out of 4 restarts, the last log line before the restart was an MCP tool call timeout:
The MCP servers in use are OAuth-authenticated and use HTTP streaming connections with a 30-second server-side timeout. This causes frequent reconnects — roughly every 30 seconds per connected user.
Environment:
node:20-alpine)Claude Analysis
Node.js 15+ terminates the process by default when a Promise rejection goes unhandled. The codebase registers an
uncaughtExceptionhandler but nounhandledRejectionhandler, leaving this default behavior in place.Several places in the MCP and OAuth reconnection code appear to call
asyncfunctions in a fire-and-forget pattern — withoutawaitand without.catch(). If any of those async operations throw internally, the resulting rejection has no handler. Given that these code paths are triggered exactly during MCP timeouts and OAuth reconnect cycles (which happen frequently with 30s HTTP Streaming connections), it is plausible that an unhandled rejection from one of these calls is what causes the process to stop responding and eventually be killed by the liveness probe.Reproduction Update
We were able to reproduce the crash with the following configurations (MCP server itself unchanged throughout):
Beta Was this translation helpful? Give feedback.
All reactions