[automatic failover] Introduce fast failover mode - a thread-sync-free approach #4223

atakavci · 2025-08-04T14:25:53Z

Closing this in favor of #4226.

This has been an idea to "failover immediately" while trying to avoid any thread synchronization operation/overhead and introduce necessary constructs to manage waiting/blocked threads. While this is still valid and viable option, we also start tinkering around a way to enable the core components for more flexibility and decided to put more effort and courage to do so. #4226 presents the way we chose to proceed for a "fast failover".

This PR is SUPERSEDED by #4226

i decided to keep it open anyway, since i am uncertain whether this is the right moment for changing creational behaviour of central components.

This PR is based on changes in previous #4207.
Changes here should be also reviewed in comparison with #4220
This is thread-sync-free approach(compared to #4220) for failing fast with on-going command executions and connection inits.

Summary of the changes in PR;

Added fast failover feature - forcibly disconnects old cluster connections during switch via help of TrackingConnectionPool
Added cluster switch event notifications - detailed event args with reason and endpoint info, added switch reason tracking wihch categorizes failover triggers (circuit breaker, health check, forced)
Cluster health validation for borrowing cluster resource - throws exception when getting connection from unhealthy cluster
Enhanced cluster resource management - proper cleanup with ConnectionPool and HealthCheckStrategy
Improved failover test coverage - parameterized tests with timing and thread safety validation
Introduce InitializtionTracker - to track list of connections during their construction phase
Added builders for Connection and ConnectionFactory - helping to set InitializationTracker for connections

Commits essential to this one are;

- Healtstatus manager with initial listener and registration logic - pluggable health checker strategy introduced, these are draft NoOpStrategy, EchoStrategy, LagAwareStrategy, - fix failing tests impacted from weighted clusters

- add echo ot CommandObjects and UnifiedJEdis - improve StrategySupplier by accepting jedisclientconfig - adapt EchoStrategy to StrategySupplier. Now it handles the creation of connection by accepting endpoint and JedisClientConfig - make healthchecks disabled by default - drop noOpStrategy - add unit&integration tests for health check

- clear redundant catch - replace failover options and drop failoveroptions class - remove forced_unhealthy from healthstatus - fix failback check - add disabled flag to cluster - update/fix related tests

Co-authored-by: Copilot <[email protected]>

- replace failback enabled with failbacksupported in client - fix formatting - set defaults

- fix failing tests

- introduce graceperiod - fix issue when CB is forced_open and gracePeriod is completed

… results during consturction of provider - add HealthStatus.UNKNOWN as default for Cluster - handle status changes in order of events during initialization - add tests for status tracker and orderingof events - fix impacted unit&integ tests

- fix formatting

- downgrade logback version for slf4j compatibility - increase timeouts for faultInjector

…MultiClusterPooledConnectionProvider - add test for init and post init events - fix failing tests

- fix failing tests due to method name change

- fix broken echostrategy due to connection issue - make healtthCheckStrategy closable and close on - adding fastfailover mode to config and provider - add local failover tests for total failover duration

…actory

- added builders to connection and connectionFactory - introduce initializtionTracker to track list of connections during their construction.

Copilot

Pull Request Overview

This PR introduces a comprehensive fast failover mechanism for the Jedis Redis client, providing thread-sync-free cluster switching with enhanced health monitoring and automatic failback capabilities.

Key Changes:

Fast failover implementation - Forcibly disconnects old cluster connections during failover using TrackingConnectionPool for immediate traffic redirection
Enhanced health monitoring system - Comprehensive health check strategies with configurable intervals, grace periods, and automatic status tracking
Automatic failback mechanism - Periodic checks to return to higher-weighted healthy clusters with configurable intervals and grace periods

Reviewed Changes

Copilot reviewed 56 out of 58 changed files in this pull request and generated 6 comments.

File	Description
MultiClusterPooledConnectionProvider.java	Core failover logic with health status management, weighted cluster selection, and periodic failback scheduling
TrackingConnectionPool.java	Connection pool wrapper that tracks active connections and enables forced disconnection during failover
mcf/*.java	Health check framework including status tracking, event management, and various health check strategies
Test files	Comprehensive test coverage for failover scenarios, health checks, and integration testing with toxiproxy

src/main/java/redis/clients/jedis/mcf/RedisRestAPIHelper.java

src/main/java/redis/clients/jedis/mcf/LagAwareStrategy.java

src/main/java/redis/clients/jedis/providers/MultiClusterPooledConnectionProvider.java

src/main/java/redis/clients/jedis/mcf/TrackingConnectionPool.java

src/main/java/redis/clients/jedis/mcf/StatusTracker.java

- do not throw exception is failover already happening

atakavci · 2025-08-14T13:20:19Z

Closing this in favor of #4226.

This has been an idea to "failover immediately" while trying to avoid any thread synchronization operation/overhead and introduce necessary constructs to manage waiting/blocked threads. While this is still valid and viable option, we also start tinkering around a way to enable the core components for more flexibility and decided to put more effort and courage to do so. #4226 presents the way we chose to proceed for a "fast failover".

atakavci and others added 28 commits June 27, 2025 19:13

- weighted cluster seleciton

8a9f876

- Healtstatus manager with initial listener and registration logic - pluggable health checker strategy introduced, these are draft NoOpStrategy, EchoStrategy, LagAwareStrategy, - fix failing tests impacted from weighted clusters

- fix naming

df66b1e

clean up and mark override methods

13757f5

fix link in javadoc

ef5d83a

fix formatting

a15fc64

- fix double registered listeners in healtstatusmgr

cf38240

- clear redundant catch - replace failover options and drop failoveroptions class - remove forced_unhealthy from healthstatus - fix failback check - add disabled flag to cluster - update/fix related tests

Update src/main/java/redis/clients/jedis/mcf/EchoStrategy.java

c2fb34c

Co-authored-by: Copilot <[email protected]>

- add remove endpoints

ade866d

- replace cluster disabled with failbackCandidate

ca3378d

- replace failback enabled with failbacksupported in client - fix formatting - set defaults

- remove failback candidate

ddcec73

- fix failing tests

- fix remove logic

c1b6d5f

- fix failing tests

- periodic failback checks

ff16330

- introduce graceperiod - fix issue when CB is forced_open and gracePeriod is completed

- introduce forceActiveCluster by duration

975ab78

- fix formatting

- fix failing tests by waiting on clusters to get healthy

405101e

- fix failing scenario test

607c66d

- downgrade logback version for slf4j compatibility - increase timeouts for faultInjector

- adressing reviews and feedback

aaac8f7

- fix formatting

2ffffef

- fix formatting

e6e1121

- get rid of the queue and event ordering for healthstatus change in …

b8d4e87

…MultiClusterPooledConnectionProvider - add test for init and post init events - fix failing tests

- replace use of reflection with helper methods

1ae7219

- fix failing tests due to method name change

- introduce clusterSwitchEvent and drop clusterFailover post processor

397f437

- fix broken echostrategy due to connection issue - make healtthCheckStrategy closable and close on - adding fastfailover mode to config and provider - add local failover tests for total failover duration

- introduce fastfailover using objectMaker injection into connectionF…

ab05e6c

…actory

- polish

de034f4

- cleanup

df3d555

- improve healtcheck thread visibility

3352260

- introduce TrackingConnectionPool with FailFastConnectionFactory

812979a

- added builders to connection and connectionFactory - introduce initializtionTracker to track list of connections during their construction.

atakavci requested review from uglide and ggivo August 4, 2025 14:25

atakavci requested review from a-TODO-rov and Copilot August 4, 2025 14:25

atakavci self-assigned this Aug 4, 2025

atakavci added the feature label Aug 4, 2025

atakavci changed the base branch from master to feature/automatic-failover August 4, 2025 14:26

Copilot AI reviewed Aug 4, 2025

View reviewed changes

atakavci added 7 commits August 5, 2025 13:51

- return broken source as usual

0ad3bbe

- do not throw exception is failover already happening

- unblock waiting threads

13cc8db

- failover by closing the pool

1c2b549

- formatting

21a95a2

- check waiters and active/idle connections to force disconnect

984db94

- add builder to trackingconnectionpool

5350cfc

- fix failing tests due to mocked ctor for trackingConnectionPool

03ac208

atakavci mentioned this pull request Aug 8, 2025

[automatic failover] Introduce fast failover mode - a thread-sync-free approach with builders #4226

Merged

atakavci closed this Aug 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[automatic failover] Introduce fast failover mode - a thread-sync-free approach #4223

[automatic failover] Introduce fast failover mode - a thread-sync-free approach #4223

Uh oh!

atakavci commented Aug 4, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

atakavci commented Aug 14, 2025

Uh oh!

Uh oh!

[automatic failover] Introduce fast failover mode - a thread-sync-free approach #4223

[automatic failover] Introduce fast failover mode - a thread-sync-free approach #4223

Uh oh!

Conversation

atakavci commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Closing this in favor of #4226.

This PR is SUPERSEDED by #4226

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Key Changes:

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

atakavci commented Aug 14, 2025

Uh oh!

Uh oh!

atakavci commented Aug 4, 2025 •

edited

Loading