Skip to content

Nats consumer retry and replicas#9975

Open
NSTA2 wants to merge 2 commits intodotnet:mainfrom
NSTA2:fix/nats-consumer-retry-and-replicas
Open

Nats consumer retry and replicas#9975
NSTA2 wants to merge 2 commits intodotnet:mainfrom
NSTA2:fix/nats-consumer-retry-and-replicas

Conversation

@NSTA2
Copy link
Copy Markdown

@NSTA2 NSTA2 commented Mar 26, 2026

NATS streaming consumer permanent failure after transient JetStream errors

Problem

When running Orleans against a multi-node NATS JetStream cluster, a rolling restart of NATS nodes (routine maintenance, crash recovery, scaling) causes permanent consumer failure with no recovery path short of restarting the Orleans silo.

Two issues combine to produce this:

1. No retry on consumer initialization

NatsStreamConsumer.Initialize() is called exactly once during startup. If it fails due to a transient error — timeout during JetStream leader election, network blip, temporary unavailability — the internal _consumer field stays null permanently. Every subsequent GetMessages() poll (~100ms) logs at Error level:

Internal NATS Consumer is not initialized. Provider: {Provider} | Stream: {Stream} | Partition: {Partition}.

…and returns empty, indefinitely, with no self-healing path.

2. Hardcoded R1 streams

NatsConnectionManager.Initialize() creates the JetStream stream without setting NumReplicas, defaulting to R1 (single replica). R1 streams have exactly one leader; any node restart makes the stream temporarily unavailable during leader election — which is the trigger for bug #1.

Combined effect: After a NATS rolling update, Orleans consumers enter a permanent error loop on every poll cycle, producing a flood of error logs and zero message delivery until the entire Orleans pod is restarted.

Root Cause

NATS node restart
  → R1 stream leader temporarily unavailable
    → NatsStreamConsumer.Initialize() throws (timeout / leader election)
      → _consumer stays null permanently
        → Every GetMessages() poll logs Error + returns empty
          → No recovery without full silo restart

Changes

File Change
NatsStreamConsumer.cs When _consumer is null in GetMessages(), attempt lazy re-initialization before returning empty. Piggybacks on the existing Orleans pulling agent poll cadence. Log level changed from Error to Warning — transient retries during rolling updates are expected, not permanent failures.
NatsOptions.cs Added NumReplicas property (default 1, backward compatible). Added validation in NatsStreamOptionsValidator ensuring NumReplicas >= 1.
NatsConnectionManager.cs Passes NumReplicas to StreamConfig when creating JetStream streams. Handles NATS error code 10058 (stream exists with different config) by attempting an in-place UpdateStreamAsync, enabling replica count upgrades without manual stream deletion.
NatsOptionsTests.cs (new) Unit tests for validator (invalid/valid NumReplicas, missing/empty StreamName), default value assertion, and an integration test verifying NumReplicas flows through to JetStream stream config.

Usage

siloBuilder.AddNatsStreams("my-provider", options =>
{
    options.StreamName = "my-stream";
    options.NumReplicas = 3; // R3 for production HA clusters (≥ 3 NATS nodes)
});

Backward Compatibility

  • NumReplicas defaults to 1 — existing deployments are unaffected.
  • The retry in GetMessages() is purely additive — previously broken consumers now self-heal.
  • Error 10058 handling is additive — previously this was an unhandled exception that crashed the adapter factory.

Testing

  • All 50 existing NATS integration tests pass (stream, subscription multiplicity, client stream).
  • 10 new tests added: validator unit tests (no NATS required) + integration test verifying NumReplicas is applied to JetStream stream config.
  • The 2 pre-existing NatsAdapterTests.SendAndReceiveFromNats failures are unrelated (cache cursor requests SeqNum=0 but JetStream sequences start at 1).
  • R3 integration testing requires a multi-node NATS cluster and should be validated in CI with a 3-node docker-compose setup.
Microsoft Reviewers: Open in CodeFlow

NSTA2 added 2 commits March 26, 2026 16:18
- NatsStreamConsumer.GetMessages() now lazily retries initialization when
  _consumer is null, making the consumer self-healing after transient
  JetStream failures (leader election, timeout, network blip).
  Log level changed from Error to Warning for transient retry attempts.

- Added NumReplicas property to NatsOptions (default 1, backward compatible).
  NatsConnectionManager now passes NumReplicas to StreamConfig.

- NatsConnectionManager handles NATS error code 10058 (stream exists with
  different config) by attempting an in-place UpdateStreamAsync, enabling
  replica count upgrades without manual stream deletion.

- Added NumReplicas validation (>= 1) to NatsStreamOptionsValidator.
- Unit tests for NatsStreamOptionsValidator: invalid NumReplicas (0, -1, -100)
  throws OrleansConfigurationException; valid values (1, 3, 5) pass.
- Unit tests for existing StreamName validation (null, whitespace).
- Default value assertion: NumReplicas defaults to 1.
- Integration test: verifies NumReplicas=1 is applied to JetStream StreamConfig.
- R3 testing noted as requiring a multi-node NATS cluster (CI-level concern).
@NSTA2
Copy link
Copy Markdown
Author

NSTA2 commented Mar 26, 2026

@dotnet-policy-service agree company="Microsoft"

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves resiliency of the Orleans NATS JetStream streaming provider in multi-node NATS clusters by (1) enabling consumer self-healing after transient JetStream errors and (2) allowing stream replica count configuration so rolling restarts don’t permanently break consumption.

Changes:

  • Add lazy re-initialization of NatsStreamConsumer when _consumer is null during polling.
  • Introduce NatsOptions.NumReplicas with validation and apply it when creating JetStream streams (plus attempt stream update on config mismatch).
  • Add tests covering NumReplicas defaults/validation and a JetStream config assertion.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
test/Extensions/Orleans.Streaming.NATS.Tests/NatsOptionsTests.cs Adds unit tests for options validation and a JetStream-facing test related to replicas.
src/Orleans.Streaming.NATS/Providers/NatsStreamConsumer.cs Adds retry-on-poll initialization logic and lowers severity of “not initialized” logging.
src/Orleans.Streaming.NATS/Providers/NatsConnectionManager.cs Plumbs NumReplicas into stream creation and attempts UpdateStreamAsync on config mismatch.
src/Orleans.Streaming.NATS/NatsOptions.cs Adds NumReplicas option and validates it is at least 1.

Comment on lines +46 to +50
this._logger.LogWarning(
"NATS Consumer not initialized — attempting re-initialization. Provider: {Provider} | Stream: {Stream} | Partition: {Partition}.",
provider, stream, partition);

await Initialize(cancellationToken);
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetMessages logs a warning on every poll cycle while the consumer is uninitialized. During a longer outage/misconfiguration this can still produce very high log volume (poll cadence is ~100ms). Consider adding a backoff/rate-limit (eg, log once then periodically, or only log after N consecutive failures), while still retrying initialization each cycle.

Copilot uses AI. Check for mistakes.
Comment on lines +55 to +62
/// <summary>
/// The number of stream replicas in the NATS JetStream cluster.
/// Higher values improve availability during node restarts (R3 survives
/// single-node failures in a 3-node cluster). Must be an odd number
/// and cannot exceed the number of NATS nodes.
/// Defaults to 1. Set to 3 for production clusters with ≥ 3 nodes.
/// </summary>
public int NumReplicas { get; set; } = 1;
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The XML docs state NumReplicas "must be an odd number" and "cannot exceed the number of NATS nodes", but the validator only enforces >= 1. Either relax the docs to match what is actually enforced, or add validation for the odd-number requirement (and consider how to handle/enforce the upper bound, if possible).

Copilot uses AI. Check for mistakes.
Comment on lines 111 to 115
var streamConfig = new StreamConfig(this._options.StreamName, [$"{this._providerName}.>"])
{
Retention = StreamConfigRetention.Workqueue,
NumReplicas = this._options.NumReplicas,
SubjectTransform = new SubjectTransform
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

streamConfig is constructed twice (CreateStreamAsync + UpdateStreamAsync) with the same fields. Consider extracting a private helper/local function to build the config once to avoid future drift (eg, if Retention/SubjectTransform changes later but only one path is updated).

Copilot uses AI. Check for mistakes.
Comment on lines +100 to +104
var streamConfig = new StreamConfig(streamName, [$"test-replicas-provider.>"])
{
Retention = StreamConfigRetention.Workqueue,
NumReplicas = 1
};
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumReplicas_IsAppliedToJetStreamConfig doesn't exercise the provider path which was changed (NatsOptions -> NatsConnectionManager -> StreamConfig). The test currently hard-codes NumReplicas = 1 directly on StreamConfig, so it will pass even if _options.NumReplicas is never applied by the provider. Consider updating this to initialize a NatsConnectionManager with options.NumReplicas and asserting the resulting stream info, or rename the test so it reflects what is actually being validated.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants