feat(jobs): implement exponential backoff for unreachable servers#9184
Merged
andrasbacsai merged 2 commits intonextfrom Mar 31, 2026
Merged
feat(jobs): implement exponential backoff for unreachable servers#9184andrasbacsai merged 2 commits intonextfrom
andrasbacsai merged 2 commits intonextfrom
Conversation
Reduce load on unreachable servers by implementing exponential backoff during connectivity failures. Check frequency decreases based on consecutive failure count: 0-2: every cycle 3-5: ~15 min intervals 6-11: ~30 min intervals 12+: ~60 min intervals Uses server ID hash to distribute checks across cycles and prevent thundering herd. ServerCheckJob and ServerConnectionCheckJob increment unreachable_count on failures. ServerManagerJob applies backoff logic before dispatching checks. Includes comprehensive test coverage.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
unreachable_counttracking toServerCheckJobandServerConnectionCheckJob— incremented on timeouts, connection failures, and exceptions; reset to 0 when server becomes reachableServerManagerJob.shouldSkipDueToBackoff()that reduces check frequency based on consecutive failures: 0-2 failures run every cycle, 3-5 failures ~15min, 6-11 failures ~30min, 12+ failures ~60minServerConnectionCheckJobtimeout from 30s to 15s for faster failure detectionTimeoutExceededExceptiondirectly instead of fully qualified names