Skip to content

feat(jobs): implement exponential backoff for unreachable servers#9184

Merged
andrasbacsai merged 2 commits intonextfrom
unreachable-server-backoff
Mar 31, 2026
Merged

feat(jobs): implement exponential backoff for unreachable servers#9184
andrasbacsai merged 2 commits intonextfrom
unreachable-server-backoff

Conversation

@andrasbacsai
Copy link
Copy Markdown
Member

Summary

  • Add unreachable_count tracking to ServerCheckJob and ServerConnectionCheckJob — incremented on timeouts, connection failures, and exceptions; reset to 0 when server becomes reachable
  • Implement exponential backoff in ServerManagerJob.shouldSkipDueToBackoff() that reduces check frequency based on consecutive failures: 0-2 failures run every cycle, 3-5 failures ~15min, 6-11 failures ~30min, 12+ failures ~60min
  • Use server ID hash to distribute backoff checks across cycles, preventing thundering herd on recovery
  • Reduce ServerConnectionCheckJob timeout from 30s to 15s for faster failure detection
  • Add comprehensive unit tests covering backoff intervals, distribution, and unreachable count increments
  • Refactor imports to use TimeoutExceededException directly instead of fully qualified names

Reduce load on unreachable servers by implementing exponential backoff
during connectivity failures. Check frequency decreases based on
consecutive failure count:
  0-2: every cycle
  3-5: ~15 min intervals
  6-11: ~30 min intervals
  12+: ~60 min intervals

Uses server ID hash to distribute checks across cycles and prevent
thundering herd.

ServerCheckJob and ServerConnectionCheckJob increment unreachable_count
on failures. ServerManagerJob applies backoff logic before dispatching
checks. Includes comprehensive test coverage.
@andrasbacsai andrasbacsai merged commit 83caaba into next Mar 31, 2026
3 checks passed
@andrasbacsai andrasbacsai deleted the unreachable-server-backoff branch March 31, 2026 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant