-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
Search before reporting
- I searched in the issues and found nothing similar.
Motivation
The current AutoClusterFailoverBuilder exposes:
failoverDelay(long, TimeUnit)switchBackDelay(long, TimeUnit)checkInterval(long, TimeUnit)
It uses periodic checks plus timestamps (failedTimestamp / recoverTimestamp) to decide when to switch.
These configs are ambiguous when the implementation actually makes decisions based on periodic health probes.
The current behavior depends on checkInterval, and the delay values are effectively converted into consecutive probe counts. That makes the real semantics harder to understand from the API alone.
For example, with the current API:
checkInterval = 400 msfailoverDelay = 1000 ms
Users may expect failover after roughly 1000 ms of unavailability. In practice, the implementation only evaluates state on each probe, so the actual switching behavior depends on the probe cadence and on when the first failed observation is recorded. This becomes even harder to reason about when availability is unstable between checks.
A concrete example:
t0: currently using the primary clustert0 + 400 ms: the probe detects the primary is unavailable, so the implementation records the first failed timestampt0 + 800 ms: the probe still sees the primary unavailable, but only 400 ms have elapsed since the first failed observationt0 + 1200 ms: the probe still sees the primary unavailable, but only 800 ms have elapsed since the first failed observationt0 + 1600 ms: the probe still sees the primary unavailable, and now the elapsed time since the first failed observation exceeds 1000 ms, so failover is attempted
From an operator's point of view, the service remained unavailable for about 1600 ms before failover was triggered, even though failoverDelay was configured as 1000 ms.
An unstable-network example is also harder to explain with delay-based configuration:
t0: primary becomes unavailablet0 + 400 ms: probe sees primary unavailablet0 + 800 ms: probe sees primary available againt0 + 1200 ms: probe sees primary availablet0 + 1600 ms: probe sees primary unavailable again
The underlying issue is that the delay-based contract is harder to understand than the actual probe-based decision model.
The switchBackDelay config has the similar issue.
Solution
Consider evolving the Java-facing AutoClusterFailover API to expose the configuration in terms of consecutive probe thresholds:
For example:
failoverThreshold: number of consecutive failed probes required before switching away from the current clusterswitchBackThreshold: number of consecutive successful probes to the primary required before switching back from a secondary
Builder APIs would become: failoverThreshold(int threshold) and switchBackThreshold(int threshold)
This matches the real decision model used by health probes and removes the mental conversion from durations to probe counts.
There is also precedent in Pulsar for a threshold-based failover API: SameAuthParamsLookupAutoClusterFailover already exposes failoverThreshold and recoverThreshold.
Note: I might introduce a new ServiceInfoProvider interface to replace the current ServiceUrlProvider interface, so a new API won't break anything.
Alternatives
Keep the existing delay-based APIs and document more precisely that:
- delays are evaluated only on probe boundaries
- failover and switch back are determined by periodic observations rather than continuous monitoring
- the elapsed wall-clock time seen by users can overshoot the configured delay depending on
checkInterval
This is workable and preserves full Java compatibility, but it still leaves the contract harder to understand.
Another option is to add threshold-based APIs in Java and C++ while keeping the existing delay-based APIs for backward compatibility. That is safer for adoption, but it adds redundant configuration and API complexity.
Anything else?
No response
Are you willing to submit a PR?
- I'm willing to submit a PR!