Skip to content

Enhancement: clarify AutoClusterFailover failover semantics and consider explicit probe-threshold APIs #25326

@BewareMyPower

Description

@BewareMyPower

Search before reporting

  • I searched in the issues and found nothing similar.

Motivation

The current AutoClusterFailoverBuilder exposes:

  • failoverDelay(long, TimeUnit)
  • switchBackDelay(long, TimeUnit)
  • checkInterval(long, TimeUnit)

It uses periodic checks plus timestamps (failedTimestamp / recoverTimestamp) to decide when to switch.

These configs are ambiguous when the implementation actually makes decisions based on periodic health probes.

The current behavior depends on checkInterval, and the delay values are effectively converted into consecutive probe counts. That makes the real semantics harder to understand from the API alone.

For example, with the current API:

  • checkInterval = 400 ms
  • failoverDelay = 1000 ms

Users may expect failover after roughly 1000 ms of unavailability. In practice, the implementation only evaluates state on each probe, so the actual switching behavior depends on the probe cadence and on when the first failed observation is recorded. This becomes even harder to reason about when availability is unstable between checks.

A concrete example:

  • t0: currently using the primary cluster
  • t0 + 400 ms: the probe detects the primary is unavailable, so the implementation records the first failed timestamp
  • t0 + 800 ms: the probe still sees the primary unavailable, but only 400 ms have elapsed since the first failed observation
  • t0 + 1200 ms: the probe still sees the primary unavailable, but only 800 ms have elapsed since the first failed observation
  • t0 + 1600 ms: the probe still sees the primary unavailable, and now the elapsed time since the first failed observation exceeds 1000 ms, so failover is attempted

From an operator's point of view, the service remained unavailable for about 1600 ms before failover was triggered, even though failoverDelay was configured as 1000 ms.

An unstable-network example is also harder to explain with delay-based configuration:

  • t0: primary becomes unavailable
  • t0 + 400 ms: probe sees primary unavailable
  • t0 + 800 ms: probe sees primary available again
  • t0 + 1200 ms: probe sees primary available
  • t0 + 1600 ms: probe sees primary unavailable again

The underlying issue is that the delay-based contract is harder to understand than the actual probe-based decision model.

The switchBackDelay config has the similar issue.

Solution

Consider evolving the Java-facing AutoClusterFailover API to expose the configuration in terms of consecutive probe thresholds:

For example:

  • failoverThreshold: number of consecutive failed probes required before switching away from the current cluster
  • switchBackThreshold: number of consecutive successful probes to the primary required before switching back from a secondary

Builder APIs would become: failoverThreshold(int threshold) and switchBackThreshold(int threshold)

This matches the real decision model used by health probes and removes the mental conversion from durations to probe counts.

There is also precedent in Pulsar for a threshold-based failover API: SameAuthParamsLookupAutoClusterFailover already exposes failoverThreshold and recoverThreshold.

Note: I might introduce a new ServiceInfoProvider interface to replace the current ServiceUrlProvider interface, so a new API won't break anything.

Alternatives

Keep the existing delay-based APIs and document more precisely that:

  • delays are evaluated only on probe boundaries
  • failover and switch back are determined by periodic observations rather than continuous monitoring
  • the elapsed wall-clock time seen by users can overshoot the configured delay depending on checkInterval

This is workable and preserves full Java compatibility, but it still leaves the contract harder to understand.

Another option is to add threshold-based APIs in Java and C++ while keeping the existing delay-based APIs for backward compatibility. That is safer for adoption, but it adds redundant configuration and API complexity.

Anything else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

Labels

type/enhancementThe enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions