[automatic failover] Introduce fast failover mode #4220

atakavci · 2025-07-31T07:40:00Z

Closing this in favor of #4226.

This has been an idea to "failover immediately" while trying to avoid any changes on core components in lib. While this is still valid and viable option, we also start tinkering around a way to enable the core components for more flexibility and decided to put more effort and courage to do so. #4226 presents the way we chose to proceed for a "fast failover".

This PR is based on changes in previous #4207.

Summary of the changes in PR;

Added fast failover feature - forcibly disconnects old cluster connections during switch via help of TrackingConnectionPool
Added cluster switch event notifications - detailed event args with reason and endpoint info, added switch reason tracking wihch categorizes failover triggers (circuit breaker, health check, forced)
Cluster health validation for borrowing cluster resource - throws exception when getting connection from unhealthy cluster
Enhanced cluster resource management - proper cleanup with ConnectionPool and HealthCheckStrategy
Improved failover test coverage - parameterized tests with timing and thread safety validation

Commits essential to this one are;

- Healtstatus manager with initial listener and registration logic - pluggable health checker strategy introduced, these are draft NoOpStrategy, EchoStrategy, LagAwareStrategy, - fix failing tests impacted from weighted clusters

- add echo ot CommandObjects and UnifiedJEdis - improve StrategySupplier by accepting jedisclientconfig - adapt EchoStrategy to StrategySupplier. Now it handles the creation of connection by accepting endpoint and JedisClientConfig - make healthchecks disabled by default - drop noOpStrategy - add unit&integration tests for health check

- clear redundant catch - replace failover options and drop failoveroptions class - remove forced_unhealthy from healthstatus - fix failback check - add disabled flag to cluster - update/fix related tests

Co-authored-by: Copilot <[email protected]>

- replace failback enabled with failbacksupported in client - fix formatting - set defaults

- fix failing tests

- introduce graceperiod - fix issue when CB is forced_open and gracePeriod is completed

… results during consturction of provider - add HealthStatus.UNKNOWN as default for Cluster - handle status changes in order of events during initialization - add tests for status tracker and orderingof events - fix impacted unit&integ tests

- fix formatting

- downgrade logback version for slf4j compatibility - increase timeouts for faultInjector

…MultiClusterPooledConnectionProvider - add test for init and post init events - fix failing tests

- fix failing tests due to method name change

- fix broken echostrategy due to connection issue - make healtthCheckStrategy closable and close on - adding fastfailover mode to config and provider - add local failover tests for total failover duration

…actory

Copilot

Pull Request Overview

This PR introduces a fast failover mode for the multi-cluster Redis client by replacing the simple index-based cluster selection with a sophisticated weight-based health checking system. It adds comprehensive health monitoring capabilities, automatic failback mechanisms, and configurable grace periods to improve cluster availability and reliability.

Key Changes

Replaces index-based cluster management with weight-based endpoint selection using health checks
Introduces comprehensive health monitoring with configurable strategies and periodic status checks
Adds automatic failback mechanism with grace periods to prevent rapid switching between clusters
Implements fast failover mode that can forcibly disconnect active connections for quicker switching

Reviewed Changes

Copilot reviewed 42 out of 43 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`MultiClusterPooledConnectionProvider.java`	Core refactoring from index-based to endpoint-based cluster management with health monitoring integration
`TrackingConnectionPool.java`	New connection pool implementation with tracking and force disconnect capabilities for fast failover
Health Check Infrastructure	Multiple new files implementing health status management, event-driven monitoring, and various health check strategies
Test Files	Comprehensive test coverage for new health checking, failback mechanisms, and integration scenarios
Configuration Updates	Updates to support new health check configuration, weights, grace periods, and failback settings

Comments suppressed due to low confidence (4)

src/test/java/redis/clients/jedis/mcf/ActiveActiveLocalFailoverTest.java:144

[nitpick] The class name 'FailoverReporter' should be 'ClusterSwitchReporter' to better reflect that it handles all cluster switch events, not just failover events.

    class FailoverReporter implements Consumer<ClusterSwitchEventArgs> {

src/main/java/redis/clients/jedis/mcf/ClusterSwitchEventArgs.java:8

Field name 'ClusterName' should follow Java naming conventions and be 'clusterName' (camelCase).

    private final String ClusterName;

src/main/java/redis/clients/jedis/mcf/ClusterSwitchEventArgs.java:9

Field name 'Endpoint' should follow Java naming conventions and be 'endpoint' (camelCase).

    private final Endpoint Endpoint;

src/main/java/redis/clients/jedis/mcf/TrackingConnectionPool.java:47

[nitpick] The method name 'injector' is unclear. Consider renaming to 'wrapConnectionSupplier' or 'createConnectionWrapper' to better describe its purpose.

    private Supplier<Connection> injector(Supplier<Connection> supplier) {

src/main/java/redis/clients/jedis/providers/MultiClusterPooledConnectionProvider.java

src/test/java/redis/clients/jedis/mcf/ActiveActiveLocalFailoverTest.java

src/main/java/redis/clients/jedis/mcf/HealthCheck.java

atakavci · 2025-08-14T13:13:03Z

Closing this in favor of #4226.

This has been an idea to "failover immediately" while trying to avoid any changes on core components in lib. While this is still valid and viable option, we also start tinkering around a way to enable the core components for more flexibility and decided to put more effort and courage to do so. #4226 presents the way we chose to proceed for a "fast failover".

atakavci and others added 25 commits June 27, 2025 19:13

- weighted cluster seleciton

8a9f876

- Healtstatus manager with initial listener and registration logic - pluggable health checker strategy introduced, these are draft NoOpStrategy, EchoStrategy, LagAwareStrategy, - fix failing tests impacted from weighted clusters

- fix naming

df66b1e

clean up and mark override methods

13757f5

fix link in javadoc

ef5d83a

fix formatting

a15fc64

- fix double registered listeners in healtstatusmgr

cf38240

- clear redundant catch - replace failover options and drop failoveroptions class - remove forced_unhealthy from healthstatus - fix failback check - add disabled flag to cluster - update/fix related tests

Update src/main/java/redis/clients/jedis/mcf/EchoStrategy.java

c2fb34c

Co-authored-by: Copilot <[email protected]>

- add remove endpoints

ade866d

- replace cluster disabled with failbackCandidate

ca3378d

- replace failback enabled with failbacksupported in client - fix formatting - set defaults

- remove failback candidate

ddcec73

- fix failing tests

- fix remove logic

c1b6d5f

- fix failing tests

- periodic failback checks

ff16330

- introduce graceperiod - fix issue when CB is forced_open and gracePeriod is completed

- introduce forceActiveCluster by duration

975ab78

- fix formatting

- fix failing tests by waiting on clusters to get healthy

405101e

- fix failing scenario test

607c66d

- downgrade logback version for slf4j compatibility - increase timeouts for faultInjector

- adressing reviews and feedback

aaac8f7

- fix formatting

2ffffef

- fix formatting

e6e1121

- get rid of the queue and event ordering for healthstatus change in …

b8d4e87

…MultiClusterPooledConnectionProvider - add test for init and post init events - fix failing tests

- replace use of reflection with helper methods

1ae7219

- fix failing tests due to method name change

- introduce clusterSwitchEvent and drop clusterFailover post processor

397f437

- fix broken echostrategy due to connection issue - make healtthCheckStrategy closable and close on - adding fastfailover mode to config and provider - add local failover tests for total failover duration

- introduce fastfailover using objectMaker injection into connectionF…

ab05e6c

…actory

- polish

de034f4

atakavci requested review from uglide, ggivo, a-TODO-rov and Copilot July 31, 2025 07:40

atakavci self-assigned this Jul 31, 2025

atakavci added the feature label Jul 31, 2025

Copilot AI reviewed Jul 31, 2025

View reviewed changes

atakavci added 2 commits July 31, 2025 13:53

- cleanup

df3d555

- improve healtcheck thread visibility

3352260

atakavci mentioned this pull request Aug 4, 2025

[automatic failover] Introduce fast failover mode - a thread-sync-free approach #4223

Closed

atakavci added 7 commits August 4, 2025 17:50

- fix threads waiting on ConnectionPool resources to return

74f024f

- formatting

db23079

- fix failing tests due to mocked ctor for trackingConnectionPool

b2ebe2d

- fix test , replace mock ctors for TrackingConnectionPool

11c4d2b

- make Tracking pool wait for ongoing inits in forceDisconnect

f4eae58

- fix failover test by checking time and endpoint

4c86919

- manage the case with failovers from multiple reasons.

9a1da64

atakavci mentioned this pull request Aug 8, 2025

[automatic failover] Introduce fast failover mode - a thread-sync-free approach with builders #4226

Merged

atakavci closed this Aug 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[automatic failover] Introduce fast failover mode #4220

[automatic failover] Introduce fast failover mode #4220

Uh oh!

atakavci commented Jul 31, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

atakavci commented Aug 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

[automatic failover] Introduce fast failover mode #4220

[automatic failover] Introduce fast failover mode #4220

Uh oh!

Conversation

atakavci commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Closing this in favor of #4226.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Key Changes

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

atakavci commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

atakavci commented Jul 31, 2025 •

edited

Loading

atakavci commented Aug 14, 2025 •

edited

Loading