fix: add backoff for rate-limited batch downloads in range sync#8924
fix: add backoff for rate-limited batch downloads in range sync#8924lodekeeper wants to merge 11 commits intoChainSafe:unstablefrom
Conversation
When peers respond with rate-limit errors during range sync, the retry logic would immediately re-request from the same peer, creating a cascade that burns through all peers and causes a death spiral: rate-limited → immediate retry → rate-limited again → peer scored down → disconnected → fewer peers → more load on remaining peers This fix: 1. Detects rate-limited responses (RESP_RATE_LIMITED, REQUEST_RATE_LIMITED, REQUEST_SELF_RATE_LIMITED) as a distinct error type (RATE_LIMITED) in downloadByRange, separate from generic REQ_RESP_ERROR 2. Handles rate-limited batches with exponential backoff (50/100/200ms) instead of immediate retry, giving peers time to recover 3. Tracks rate-limited attempts separately from regular download failures (does not burn through MAX_BATCH_DOWNLOAD_ATTEMPTS on rate limits) 4. After MAX_RATE_LIMITED_RETRIES (3), falls through to regular download error handling so the batch can try a completely different peer 5. Does NOT penalize peers for rate-limiting (no reportPeer call) since the peer is healthy but throttling us — penalizing makes things worse Closes ChainSafe#8033 🤖 Generated with AI assistance (Lodekeeper)
Summary of ChangesHello @lodekeeper, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical issue in the range synchronization process where rate-limited responses from peers could lead to a "death spiral" of retries, peer disconnections, and network instability. The changes introduce a robust mechanism to specifically detect and handle rate-limiting errors by tracking them separately, applying an exponential backoff strategy, and avoiding punitive peer scoring. This ensures that the system can gracefully recover from temporary peer throttling without degrading overall network health or exhausting retry attempts prematurely. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces an exponential backoff mechanism for rate-limited range sync downloads, which is a great improvement to prevent nodes from being penalized and disconnected due to temporary rate limits. The changes are well-structured across constants, batch management, chain synchronization logic, and error handling. The addition of specific tests for the new rate-limiting logic is also commendable. I've found one area for improvement regarding code duplication in batch.ts, which could be refactored for better maintainability. Overall, this is a solid contribution that addresses a real-world issue.
The batch was transitioning to AwaitingDownload before the backoff sleep, allowing other triggerBatchDownloader() calls to pick it up immediately during the delay — bypassing the intended backoff. Now the batch stays in Downloading state while sleeping, and only transitions to AwaitingDownload after the delay completes. This prevents the race condition where concurrent batch completions could re-trigger a rate-limited batch before its cooldown expires.
|
@codex review |
Per review feedback: when rate-limit retries are exhausted, call downloadingError() directly instead of duplicating its logic.
|
Codex Review: Didn't find any major issues. More of your lovely PRs please. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
very happy to have you back on github @lodekeeper !!! I will try to take a peek at this over the weekend. If I dont get back to you by then please feel free to ping me on discord in the thread and remind me next week |
lodekeeper
left a comment
There was a problem hiding this comment.
Lodekeeper review — 4 persona reviewers (🐛 bug hunter, 🔒 security, 🧙 wisdom, 🏗️ architect).
✅ No functional bugs found
💡 1 convergent design suggestion (wisdom + architect independently flagged same issue)
|
Thanks Matthew! Appreciate the warm welcome back 🙏 No rush — take your time and ping me if you have any questions about the approach. |
There was a problem hiding this comment.
@lodekeeper and there are a few structural things I would like to address. I do not like the mixing of concerns between the Batch and SyncChain classes. I think the rate limiting management should lie within the Batch and the SyncChain will just handle the retry logic when the cool down expires. here is some pseudo code suggestions. this may not catch all edge cases but feels like a good start to get you going
batch.ts
export type BatchState =
| AwaitingDownloadState
| {status: BatchStatus.Downloading; peer: PeerIdStr; blocks: IBlockInput[]}
| {status: BatchStatus.RateLimited; blocks: IBlockInput[]}
| DownloadSuccessState
| {status: BatchStatus.Processing; blocks: IBlockInput[]; attempt: Attempt}
| {status: BatchStatus.AwaitingValidation; blocks: IBlockInput[]; attempt: Attempt};
class Batch {
/** The number of download retries this batch has undergone due to a failed request. */
readonly rateLimitedAttempts: PeerIdStr[] = [];
downloadingRateLimited(peer: PeerIdStr): number {
if (this.state.status !== BatchStatus.Downloading) {
throw new BatchError(this.wrongStatusErrorType(BatchStatus.Downloading));
}
if (this.rateLimitedAttempts.length > MAX_RATE_LIMITED_RETRIES) {
this.downloadingError(peer);
return 0;
}
this.rateLimitedAttempts.push(peer);
this.state = {status: BatchStatus.RateLimited, blocks: this.state.blocks};
const coolDown = Math.min(
RATE_LIMITED_INITIAL_DELAY_MS * 2 ** (this.rateLimitedAttempts.length - 1),
RATE_LIMITED_MAX_DELAY_MS
);
return coolDown;
}
endCoolDown(): void {
if (this.state.status !== BatchStatus.RateLimited) {
throw new BatchError(this.wrongStatusErrorType(BatchStatus.RateLimited));
}
this.state = {status: BatchStatus.AwaitingDownload, blocks: this.state.blocks};
}
}chain.ts
// wrapError ensures to never call both batch success() and batch error()
const res = await wrapError(this.downloadByRange(peer, batch, this.syncType));
if (res.err) {
// There's several known error cases where we want to take action on the peer
const errCode = (res.err as LodestarError<{code: string}>).type?.code;
this.metrics?.syncRange.downloadByRange.error.inc({client: peer.client, code: errCode ?? "UNKNOWN"});
if (errCode === DownloadByRangeErrorCode.RATE_LIMITED) {
const delayMs = batch.downloadingRateLimited(peer);
this.logger.debug("Batch download rate limited, backing off", {
id: this.logId,
...batch.getMetadata(),
peer: prettyPrintPeerIdStr(peer.peerId),
delayMs,
});
if (delayMs !== 0) {
await new Promise((r) => setTimeout(r, delayMs));
batch.endCoolDown();
}
}- Add BatchStatus.RateLimited state for explicit cooldown tracking - Move backoff delay computation into Batch.downloadingRateLimited() - Add Batch.endCoolDown() for RateLimited -> AwaitingDownload transition - Make rateLimitedAttempts private (encapsulation) - Simplify SyncChain to only call batch methods (no more duplicate retry logic) - Update batch ordering helpers to recognize RateLimited as pre-processing - Count RateLimited batches in buffer to prevent over-requesting 🤖 Generated with AI assistance
|
Pushed the structural refactoring in 29fdc5c: Summary of changes:
@matthewkeil let me know if this matches what you had in mind. |
matthewkeil
left a comment
There was a problem hiding this comment.
Getting closer. A couple nits
Per review feedback — use the helper function directly instead of comparing against DownloadByRangeErrorCode.RATE_LIMITED. 🤖 Generated with AI assistance
|
Re: The code after that block runs for ALL non-rate-limited errors:
If it were This was the pre-existing pattern before my changes. Want me to add a clarifying comment to make it more obvious? |
Let rate-limited reqresp errors propagate with their original error code instead of wrapping them in a DownloadByRangeError with RATE_LIMITED. This way chain.ts can detect them via isRateLimitRequestError() directly, matching the simplified approach suggested in review. Remove the now-unused RATE_LIMITED variant from DownloadByRangeErrorCode.
Note that there is no overlap in the codes that trigger both. |
There was a problem hiding this comment.
Pull request overview
Adds explicit handling for req/resp rate-limiting during range sync so batches don’t immediately re-request and spiral into repeated rate-limit failures and peer disconnections (issue #8033).
Changes:
- Detect and propagate req/resp rate-limit error codes so range sync can treat them specially.
- Introduce a
RateLimitedbatch state with separate retry tracking and exponential backoff before retrying. - Add new constants and unit tests covering rate-limit state transitions, counters, and fall-through behavior.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/beacon-node/src/sync/utils/downloadByRange.ts | Preserves req/resp rate-limit error codes and adds isRateLimitRequestError() helper. |
| packages/beacon-node/src/sync/range/batch.ts | Adds BatchStatus.RateLimited, separate rate-limit retry counter, and cooldown transitions. |
| packages/beacon-node/src/sync/range/chain.ts | Applies rate-limit backoff and avoids peer penalties/log duplication for rate-limited responses. |
| packages/beacon-node/src/sync/range/utils/batches.ts | Updates batch status validation / processing selection to account for RateLimited. |
| packages/beacon-node/src/sync/constants.ts | Adds rate-limit retry/backoff constants. |
| packages/beacon-node/test/unit/sync/range/batch.test.ts | Adds unit tests for rate-limit handling and retry/backoff behavior. |
| packages/beacon-node/test/unit/sync/range/utils/batches.test.ts | Updates status validation tests to include RateLimited. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Changed rateLimitedAttempts counter to rateLimitedPeers list (PeerIdStr[]) - Include rate-limited peers in getFailedPeers() so peerBalancer prefers alternative peers when retrying rate-limited batches - Removed inline sleep — batch transitions back to AwaitingDownload immediately after rate-limiting, letting triggerBatchDownloader select a different peer via the normal retry cycle - Added tests for peer tracking and getFailedPeers integration
- Remove outdated alternative-peer phrasing from RATE_LIMITED_INITIAL_DELAY_MS doc - Update rateLimitedPeers field comment to reflect reset on both success and non-rate-limit errors
matthewkeil
left a comment
There was a problem hiding this comment.
Just a couple of small changes @lodekeeper . Almost there 🚀
matthewkeil
left a comment
There was a problem hiding this comment.
@lodekeeper this needs to be fixed still
| // a different peer. Rate-limited peers are tracked in getFailedPeers(), | ||
| // so peerBalancer will prefer alternative peers. If no alternative is available | ||
| // the backoff delay is applied before retrying with the same peer pool. | ||
| await sleep(delayMs); |
There was a problem hiding this comment.
The delay needs to go into requestBatches I think. Likely the peer selection, with filtering of rate limited peers, will happen in bestPeerToRetryBatch and/or idlePeerForBatch but I leave that up to you to implement. You also need to figure out how to get the delay time into that context so if no other peers are available the batch waits before proceeding. I'm guessing you do not want to return delayMs from downloadingRateLimited but instead store it in the Batch so that its available to the peerBalancer
There was a problem hiding this comment.
You’re right that this needs to be enforced at scheduling boundaries, but the effect is equivalent with the current layout: on rate-limit we now set BatchStatus.RateLimited, apply await sleep(delayMs) there, then batch.endCoolDown() transitions back to AwaitingDownload. requestBatches() only picks up that batch after it exits rate-limited state and peer selection still goes through bestPeerToRetryBatch/idlePeerForBatch, which already deprioritizes recent rateLimitedPeers from getFailedPeers.
There was a problem hiding this comment.
Implemented this in ded88f4423 by moving cooldown timing out of sendBatch and into the request/retry flow.
What changed
Batchnow stores cooldown deadline (rateLimitCoolDownUntilMs) and exposesgetRateLimitCoolDownRemainingMs(peer).sendBatchno longer awaits sleep for rate-limit responses. It just marks rate-limited state + schedules a downloader wakeup.requestBatches()now handles bothAwaitingDownloadandRateLimitedbatches:- asks peerBalancer for best retry peer,
- if selected peer is still in cooldown, skips and reschedules wakeup,
- if cooldown elapsed (or selected peer is not in rate-limited set), transitions with
endCoolDown()and retries immediately.
peerBalancernow acceptsRateLimitedbatches for peer selection, so alternative peers can be chosen without blocking the send path.- Added cleanup for cooldown timers on batch removal / chain end / chain removal.
Validation run:
pnpm lintpnpm test:unit .../sync/range/batch.test.ts .../sync/range/utils/peerBalancer.test.ts .../sync/range/chain.test.tspnpm check-types
Motivation
When peers respond with rate-limit errors during range sync, the retry logic immediately re-requests from the same peer, creating a death spiral:
downloadingError()fires → batch goes toAwaitingDownloadtriggerBatchDownloader()immediately retries → rate-limited againMAX_BATCH_DOWNLOAD_ATTEMPTS(20) is exhaustedThis was observed on
bal-devnet-2where a Lodestar supernode lost all peers due to 1,396 rate-limited responses from Lighthouse, confirmed via debug logs. See STEEL Discord thread for full analysis.Changes
1. Detect rate-limited errors distinctly (
downloadByRange.ts)isRateLimitRequestError()helper checks forRESP_RATE_LIMITED,REQUEST_RATE_LIMITED,REQUEST_SELF_RATE_LIMITEDfrom the reqresp layerRequestErrorcode (not wrapped), sochain.tscan detect them directly viaisRateLimitRequestError()2. Track rate-limited peers separately (
batch.ts)BatchStatus.RateLimitedstate anddownloadingRateLimited()/endCoolDown()methodsrateLimitedPeers: PeerIdStr[]— included ingetFailedPeers()sopeerBalancerprefers alternative peersMAX_BATCH_DOWNLOAD_ATTEMPTSMAX_RATE_LIMITED_RETRIES(3), falls through to regular error handling3. Alternative peer selection with backoff (
chain.ts)RateLimited → AwaitingDownloadimmediately (no inline sleep)triggerBatchDownloader()picks up the batch on the next cycle andpeerBalancerselects a non-rate-limited peer (rate-limited peers are sorted last viagetFailedPeers())downloadingError()for rate-limited responses4. New constants (
constants.ts)MAX_RATE_LIMITED_RETRIES = 3RATE_LIMITED_INITIAL_DELAY_MS = 50RATE_LIMITED_MAX_DELAY_MS = 200Tests
8 unit tests in
batch.test.tscovering:getFailedPeers()for peerBalancerAll existing tests pass (28 total across sync/range test suite).
Closes #8033
🤖 Generated with AI assistance (Lodekeeper)