Skip to content

fix: remove backoff retry from coordinator metadata status operations#976

Draft
mattisonchao wants to merge 4 commits intomainfrom
fix/remove-metadata-status-backoff-retry
Draft

fix: remove backoff retry from coordinator metadata status operations#976
mattisonchao wants to merge 4 commits intomainfrom
fix/remove-metadata-status-backoff-retry

Conversation

@mattisonchao
Copy link
Copy Markdown
Member

Motivation

The coordinator status resource was using backoff.NewExponentialBackOff() directly from the cenkalti library, which has a default MaxElapsedTime of 15 minutes and is not context-aware. This held the mutex lock during the entire retry loop, couldn't be cancelled on shutdown, and silently discarded errors when retries exhausted. The callers already have their own retry mechanisms, making the status resource retry redundant and harmful.

Modification

  • Remove backoff retry from all StatusResource methods (Swap, Update, UpdateShardMetadata, DeleteShardMetadata, loadWithInitSlow)
  • Change StatusResource interface methods to return error so callers can handle failures appropriately
  • Update all callers in coordinator, shard controller, election, and split controller to handle errors
  • Update mock StatusResource in balancer test to match new interface

The status resource was using backoff.NewExponentialBackOff() from the
cenkalti library directly, which has a default MaxElapsedTime of 15
minutes and is not context-aware. This caused several issues:

- Held mutex lock during entire retry loop (up to 15 minutes)
- Not context-aware, so retries couldn't be cancelled on shutdown
- Errors silently discarded when retries exhausted
- Inconsistent with the rest of the codebase which uses oxiatime.NewBackOff(ctx)

The callers (shard controllers, elections, periodic tasks) already have
their own retry mechanisms, making the status resource retry redundant.

This change removes the backoff retry, returns errors to callers, and
lets them handle failures appropriately.

Signed-off-by: mattisonchao <mattisonchao@gmail.com>
- Check error returns from StatusResource methods in tests
- Simplify redundant if-return pattern in split_controller.go

Signed-off-by: mattisonchao <mattisonchao@gmail.com>
@mattisonchao mattisonchao force-pushed the fix/remove-metadata-status-backoff-retry branch from 218ed9d to d837be6 Compare March 25, 2026 05:42
Make ConfigChanged return an error so callers can handle swap failures
instead of looping forever when the metadata store is unavailable.
@mattisonchao mattisonchao self-assigned this Mar 25, 2026
@mattisonchao mattisonchao marked this pull request as draft March 25, 2026 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant