Skip to content

Make notification services resilient to NATS outages #2448

@jbair06

Description

@jbair06

Description:
api/chain/notification services should not crash or become unavailable if NATS is down or unreachable. They must continue to function, and a backup solution for fan-out messaging (e.g., TCP-based or similar) should be researched and implemented so that notifications can still be delivered when NATS is unavailable.

Requirements:

  • Update NATS client usage so that:
    • Service startup does not crash if NATS cannot be reached.
    • Connection loss triggers graceful degradation instead of process exit (e.g., stop publishing to NATS, but keep HTTP/API endpoints up).
  • Implement reconnection and health-handling logic:
    • Use automatic reconnection with backoff and unlimited retries where appropriate.
    • Log NATS connection errors and status changes clearly for operations.
  • Research and design a backup fan-out mechanism when NATS is unavailable:
    • Options could include direct TCP connections, database-backed outbox pattern, or another message transport that can be enabled as a fallback.
    • Define how messages are queued and delivered when NATS returns.
  • Define behavior for fan-out when NATS is down:
    • How to avoid message loss.
    • How to avoid duplicate delivery when NATS comes back and both paths might send.

Acceptance criteria:

  • api/chain/notification services remain up and responsive when NATS is unavailable (no crashes or failed container restarts).
  • NATS outages are handled via reconnection logic and clear logs, without blocking core APIs.
  • A documented backup fan-out design is agreed upon and an initial implementation is in place (or a concrete plan exists if phased).
  • Fan-out behavior is verified under simulated NATS outage (e.g., stopping the NATS cluster) and recovery, with no data loss and no service crashes.

Metadata

Metadata

Assignees

Labels

BackendFeature EnhancementEnhancing an existing feature driven by business requirements. Typically backwards compatible.

Projects

Status

👀 In review

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions