Skip to content

Internode connection issues during stemcell upgrade#376

Merged
ssunka merged 4 commits intodevelopfrom
1.8.2
Feb 12, 2026
Merged

Internode connection issues during stemcell upgrade#376
ssunka merged 4 commits intodevelopfrom
1.8.2

Conversation

@ssunka
Copy link
Contributor

@ssunka ssunka commented Jan 30, 2026

Fix internode connection resilience with retry logic and graceful degradation

This commit addresses issues where metric-store nodes would fail to start or
hang indefinitely when peer nodes were unavailable, improving cluster stability
and enabling graceful degradation.

Key Changes:

  • Enhanced Connection retry logic with configurable timeouts and max retries

    • Added WithMaxRetries, WithRetryDelay, WithConnectTimeout options
    • Connection.Connect() now returns error instead of blocking indefinitely
    • Added Connection.Client() to return client with error handling
    • Added Connection.IsConnected() to check connection state
  • Updated RemoteAppender to handle connection failures gracefully

    • Initial connection failure no longer blocks service startup
    • Writes automatically fall back to handoff queue when connection unavailable
    • Connection attempts retry on each write operation
  • Modified WriteReplayer to support reconnectable clients

    • Added reconnectableClient interface and WithWriteReplayerReconnectableClient option
    • Replayer now calls connection.Client() which triggers reconnection if needed
    • Gracefully handles nil clients with proper error messages
  • Added comprehensive metrics for connection monitoring

    • metric_store_internode_connection_attempts_total - tracks all connection attempts
    • metric_store_internode_connection_failures_total - tracks failed connections
    • metric_store_internode_connection_successes_total - tracks successful connections
    • metric_store_internode_connection_state - current state (0=disconnected, 1=connected)
  • Added configurable internode connection parameters via BOSH properties

    • internode_max_retries (default: 10)
    • internode_retry_delay (default: 1s)
    • internode_connect_timeout (default: 30s)
  • Added comprehensive test coverage

    • connection_test.go - tests retry logic, timeouts, metrics tracking
    • remote_appender_test.go - tests resilience to connection failures

Benefits:

  • Metric-store nodes start successfully even when peer nodes are down
  • No data loss - writes go to handoff queue when connections unavailable
  • Automatic recovery when peer nodes come back online
  • Better observability through connection metrics
  • Configurable retry behavior for different deployment scenarios

Files Modified:

  • src/pkg/leanstreams/connection.go - Enhanced retry logic
  • src/internal/storage/remote_appender.go - Graceful degradation
  • src/internal/handoff/write_replayer.go - Reconnectable client support
  • src/internal/storage/replicated_storage.go - Configuration plumbing
  • src/internal/metric-store/metric_store.go - Configuration options
  • src/cmd/metric-store/app/config.go - Environment variable support
  • src/cmd/metric-store/app/metric_store.go - Metrics registration
  • src/internal/metrics/metrics.go - New metric constants
  • jobs/metric-store/spec - BOSH property definitions
  • jobs/metric-store/templates/bpm.yml.erb - Environment variable injection

Test Files:

  • src/pkg/leanstreams/connection_test.go - New tests for connection logic
  • src/internal/storage/remote_appender_test.go - New tests for resilience

Certificate Updates:

  • Updated test certificates and CRLs that were expired

shrisha-c
shrisha-c previously approved these changes Feb 10, 2026
shrisha-c
shrisha-c previously approved these changes Feb 11, 2026
…radation

Enhanced connection handling to prevent service startup failures when peer nodes are unavailable. Added configurable retry timeouts, automatic fallback to handoff queues, and comprehensive connection metrics. Includes test coverage for connection resilience and remote appender behavior.
@ssunka ssunka merged commit e55f976 into develop Feb 12, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

3 participants