Skip to content

Conversation

atakavci
Copy link
Contributor

@atakavci atakavci commented Aug 4, 2025


Closing this in favor of #4226.

This has been an idea to "failover immediately" while trying to avoid any thread synchronization operation/overhead and introduce necessary constructs to manage waiting/blocked threads. While this is still valid and viable option, we also start tinkering around a way to enable the core components for more flexibility and decided to put more effort and courage to do so. #4226 presents the way we chose to proceed for a "fast failover".


This PR is SUPERSEDED by #4226

i decided to keep it open anyway, since i am uncertain whether this is the right moment for changing creational behaviour of central components.


This PR is based on changes in previous #4207.
Changes here should be also reviewed in comparison with #4220
This is thread-sync-free approach(compared to #4220) for failing fast with on-going command executions and connection inits.

Summary of the changes in PR;

  • Added fast failover feature - forcibly disconnects old cluster connections during switch via help of TrackingConnectionPool
  • Added cluster switch event notifications - detailed event args with reason and endpoint info, added switch reason tracking wihch categorizes failover triggers (circuit breaker, health check, forced)
  • Cluster health validation for borrowing cluster resource - throws exception when getting connection from unhealthy cluster
  • Enhanced cluster resource management - proper cleanup with ConnectionPool and HealthCheckStrategy
  • Improved failover test coverage - parameterized tests with timing and thread safety validation
  • Introduce InitializtionTracker - to track list of connections during their construction phase
  • Added builders for Connection and ConnectionFactory - helping to set InitializationTracker for connections

Commits essential to this one are;

atakavci and others added 28 commits June 27, 2025 19:13
- Healtstatus manager with initial listener and registration logic
- pluggable health checker strategy  introduced,  these are draft NoOpStrategy, EchoStrategy, LagAwareStrategy,
- fix failing tests impacted from weighted clusters
- add echo ot CommandObjects and UnifiedJEdis
- improve StrategySupplier by accepting jedisclientconfig
- adapt EchoStrategy to StrategySupplier. Now it handles the creation of connection by accepting endpoint and JedisClientConfig
- make healthchecks disabled by default
- drop noOpStrategy
-  add unit&integration tests for health check
- clear redundant catch
- replace failover options and drop failoveroptions class
- remove forced_unhealthy from healthstatus
- fix failback check
- add disabled flag to cluster
- update/fix related tests
- replace failback enabled with failbacksupported in client
- fix formatting
- set defaults
- fix failing tests
- fix failing tests
- introduce graceperiod
- fix issue when CB is forced_open and gracePeriod is completed
… results during consturction of provider

- add HealthStatus.UNKNOWN as default for Cluster
- handle status changes in order of events during initialization
- add tests for status tracker and orderingof events
- fix impacted unit&integ tests
- downgrade logback version for slf4j compatibility
- increase timeouts for faultInjector
…MultiClusterPooledConnectionProvider

- add test for init and post init events
- fix failing tests
- fix failing tests due to method name change
- fix broken echostrategy due to connection issue
- make healtthCheckStrategy closable and close on
- adding fastfailover mode to config and provider
- add local failover tests for total failover duration
- added builders to connection and connectionFactory
- introduce initializtionTracker to track list of connections during their construction.
@atakavci atakavci requested review from uglide and ggivo August 4, 2025 14:25
@atakavci atakavci requested review from a-TODO-rov and Copilot August 4, 2025 14:25
@atakavci atakavci self-assigned this Aug 4, 2025
@atakavci atakavci changed the base branch from master to feature/automatic-failover August 4, 2025 14:26
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a comprehensive fast failover mechanism for the Jedis Redis client, providing thread-sync-free cluster switching with enhanced health monitoring and automatic failback capabilities.

Key Changes:

  • Fast failover implementation - Forcibly disconnects old cluster connections during failover using TrackingConnectionPool for immediate traffic redirection
  • Enhanced health monitoring system - Comprehensive health check strategies with configurable intervals, grace periods, and automatic status tracking
  • Automatic failback mechanism - Periodic checks to return to higher-weighted healthy clusters with configurable intervals and grace periods

Reviewed Changes

Copilot reviewed 56 out of 58 changed files in this pull request and generated 6 comments.

File Description
MultiClusterPooledConnectionProvider.java Core failover logic with health status management, weighted cluster selection, and periodic failback scheduling
TrackingConnectionPool.java Connection pool wrapper that tracks active connections and enables forced disconnection during failover
mcf/*.java Health check framework including status tracking, event management, and various health check strategies
Test files Comprehensive test coverage for failover scenarios, health checks, and integration testing with toxiproxy

@atakavci
Copy link
Contributor Author

Closing this in favor of #4226.

This has been an idea to "failover immediately" while trying to avoid any thread synchronization operation/overhead and introduce necessary constructs to manage waiting/blocked threads. While this is still valid and viable option, we also start tinkering around a way to enable the core components for more flexibility and decided to put more effort and courage to do so. #4226 presents the way we chose to proceed for a "fast failover".

@atakavci atakavci closed this Aug 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants