Skip to content

Conversation

atakavci
Copy link
Contributor

@atakavci atakavci commented Jul 31, 2025


Closing this in favor of #4226.

This has been an idea to "failover immediately" while trying to avoid any changes on core components in lib. While this is still valid and viable option, we also start tinkering around a way to enable the core components for more flexibility and decided to put more effort and courage to do so. #4226 presents the way we chose to proceed for a "fast failover".


This PR is based on changes in previous #4207.

Summary of the changes in PR;

  • Added fast failover feature - forcibly disconnects old cluster connections during switch via help of TrackingConnectionPool
  • Added cluster switch event notifications - detailed event args with reason and endpoint info, added switch reason tracking wihch categorizes failover triggers (circuit breaker, health check, forced)
  • Cluster health validation for borrowing cluster resource - throws exception when getting connection from unhealthy cluster
  • Enhanced cluster resource management - proper cleanup with ConnectionPool and HealthCheckStrategy
  • Improved failover test coverage - parameterized tests with timing and thread safety validation

Commits essential to this one are;

atakavci and others added 25 commits June 27, 2025 19:13
- Healtstatus manager with initial listener and registration logic
- pluggable health checker strategy  introduced,  these are draft NoOpStrategy, EchoStrategy, LagAwareStrategy,
- fix failing tests impacted from weighted clusters
- add echo ot CommandObjects and UnifiedJEdis
- improve StrategySupplier by accepting jedisclientconfig
- adapt EchoStrategy to StrategySupplier. Now it handles the creation of connection by accepting endpoint and JedisClientConfig
- make healthchecks disabled by default
- drop noOpStrategy
-  add unit&integration tests for health check
- clear redundant catch
- replace failover options and drop failoveroptions class
- remove forced_unhealthy from healthstatus
- fix failback check
- add disabled flag to cluster
- update/fix related tests
- replace failback enabled with failbacksupported in client
- fix formatting
- set defaults
- fix failing tests
- fix failing tests
- introduce graceperiod
- fix issue when CB is forced_open and gracePeriod is completed
… results during consturction of provider

- add HealthStatus.UNKNOWN as default for Cluster
- handle status changes in order of events during initialization
- add tests for status tracker and orderingof events
- fix impacted unit&integ tests
- downgrade logback version for slf4j compatibility
- increase timeouts for faultInjector
…MultiClusterPooledConnectionProvider

- add test for init and post init events
- fix failing tests
- fix failing tests due to method name change
- fix broken echostrategy due to connection issue
- make healtthCheckStrategy closable and close on
- adding fastfailover mode to config and provider
- add local failover tests for total failover duration
@atakavci atakavci self-assigned this Jul 31, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a fast failover mode for the multi-cluster Redis client by replacing the simple index-based cluster selection with a sophisticated weight-based health checking system. It adds comprehensive health monitoring capabilities, automatic failback mechanisms, and configurable grace periods to improve cluster availability and reliability.

Key Changes

  • Replaces index-based cluster management with weight-based endpoint selection using health checks
  • Introduces comprehensive health monitoring with configurable strategies and periodic status checks
  • Adds automatic failback mechanism with grace periods to prevent rapid switching between clusters
  • Implements fast failover mode that can forcibly disconnect active connections for quicker switching

Reviewed Changes

Copilot reviewed 42 out of 43 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
MultiClusterPooledConnectionProvider.java Core refactoring from index-based to endpoint-based cluster management with health monitoring integration
TrackingConnectionPool.java New connection pool implementation with tracking and force disconnect capabilities for fast failover
Health Check Infrastructure Multiple new files implementing health status management, event-driven monitoring, and various health check strategies
Test Files Comprehensive test coverage for new health checking, failback mechanisms, and integration scenarios
Configuration Updates Updates to support new health check configuration, weights, grace periods, and failback settings
Comments suppressed due to low confidence (4)

src/test/java/redis/clients/jedis/mcf/ActiveActiveLocalFailoverTest.java:144

  • [nitpick] The class name 'FailoverReporter' should be 'ClusterSwitchReporter' to better reflect that it handles all cluster switch events, not just failover events.
    class FailoverReporter implements Consumer<ClusterSwitchEventArgs> {

src/main/java/redis/clients/jedis/mcf/ClusterSwitchEventArgs.java:8

  • Field name 'ClusterName' should follow Java naming conventions and be 'clusterName' (camelCase).
    private final String ClusterName;

src/main/java/redis/clients/jedis/mcf/ClusterSwitchEventArgs.java:9

  • Field name 'Endpoint' should follow Java naming conventions and be 'endpoint' (camelCase).
    private final Endpoint Endpoint;

src/main/java/redis/clients/jedis/mcf/TrackingConnectionPool.java:47

  • [nitpick] The method name 'injector' is unclear. Consider renaming to 'wrapConnectionSupplier' or 'createConnectionWrapper' to better describe its purpose.
    private Supplier<Connection> injector(Supplier<Connection> supplier) {

@atakavci
Copy link
Contributor Author

atakavci commented Aug 14, 2025

Closing this in favor of #4226.

This has been an idea to "failover immediately" while trying to avoid any changes on core components in lib. While this is still valid and viable option, we also start tinkering around a way to enable the core components for more flexibility and decided to put more effort and courage to do so. #4226 presents the way we chose to proceed for a "fast failover".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants