Skip to content

fix: add backoff for rate-limited batch downloads in range sync#8924

Open
lodekeeper wants to merge 11 commits intoChainSafe:unstablefrom
lodekeeper:fix/sync-rate-limit-backoff
Open

fix: add backoff for rate-limited batch downloads in range sync#8924
lodekeeper wants to merge 11 commits intoChainSafe:unstablefrom
lodekeeper:fix/sync-rate-limit-backoff

Conversation

@lodekeeper
Copy link
Contributor

@lodekeeper lodekeeper commented Feb 17, 2026

Motivation

When peers respond with rate-limit errors during range sync, the retry logic immediately re-requests from the same peer, creating a death spiral:

  1. Batch download gets rate-limited
  2. downloadingError() fires → batch goes to AwaitingDownload
  3. triggerBatchDownloader() immediately retries → rate-limited again
  4. Repeat until MAX_BATCH_DOWNLOAD_ATTEMPTS (20) is exhausted
  5. With multiple batches, this amplifies across all peers → peers get scored down → disconnected → fewer peers → more load on remaining → spiral

This was observed on bal-devnet-2 where a Lodestar supernode lost all peers due to 1,396 rate-limited responses from Lighthouse, confirmed via debug logs. See STEEL Discord thread for full analysis.

Changes

1. Detect rate-limited errors distinctly (downloadByRange.ts)

  • isRateLimitRequestError() helper checks for RESP_RATE_LIMITED, REQUEST_RATE_LIMITED, REQUEST_SELF_RATE_LIMITED from the reqresp layer
  • Rate-limited responses are rethrown with their original RequestError code (not wrapped), so chain.ts can detect them directly via isRateLimitRequestError()

2. Track rate-limited peers separately (batch.ts)

  • New BatchStatus.RateLimited state and downloadingRateLimited() / endCoolDown() methods
  • Rate-limited peers tracked in rateLimitedPeers: PeerIdStr[] — included in getFailedPeers() so peerBalancer prefers alternative peers
  • Rate-limited attempts don't burn through MAX_BATCH_DOWNLOAD_ATTEMPTS
  • After MAX_RATE_LIMITED_RETRIES (3), falls through to regular error handling
  • Counter resets on successful download or non-rate-limit error

3. Alternative peer selection with backoff (chain.ts)

  • Rate-limited batches transition RateLimited → AwaitingDownload immediately (no inline sleep)
  • triggerBatchDownloader() picks up the batch on the next cycle and peerBalancer selects a non-rate-limited peer (rate-limited peers are sorted last via getFailedPeers())
  • No peer penalties for rate-limiting — the peer is healthy but throttling us
  • Avoids duplicate error logging and downloadingError() for rate-limited responses

4. New constants (constants.ts)

  • MAX_RATE_LIMITED_RETRIES = 3
  • RATE_LIMITED_INITIAL_DELAY_MS = 50
  • RATE_LIMITED_MAX_DELAY_MS = 200

Tests

8 unit tests in batch.test.ts covering:

  • State transition on rate-limited download
  • Consecutive attempt counting and exponential delay
  • Fall-through to regular error after max retries
  • Combined failure tracking (rate-limit + regular)
  • Rate-limited peers included in getFailedPeers() for peerBalancer
  • Peer list cleared on success
  • Counter reset on regular error

All existing tests pass (28 total across sync/range test suite).

Closes #8033

🤖 Generated with AI assistance (Lodekeeper)

When peers respond with rate-limit errors during range sync, the retry
logic would immediately re-request from the same peer, creating a
cascade that burns through all peers and causes a death spiral:

  rate-limited → immediate retry → rate-limited again → peer scored
  down → disconnected → fewer peers → more load on remaining peers

This fix:

1. Detects rate-limited responses (RESP_RATE_LIMITED, REQUEST_RATE_LIMITED,
   REQUEST_SELF_RATE_LIMITED) as a distinct error type (RATE_LIMITED) in
   downloadByRange, separate from generic REQ_RESP_ERROR

2. Handles rate-limited batches with exponential backoff (50/100/200ms)
   instead of immediate retry, giving peers time to recover

3. Tracks rate-limited attempts separately from regular download failures
   (does not burn through MAX_BATCH_DOWNLOAD_ATTEMPTS on rate limits)

4. After MAX_RATE_LIMITED_RETRIES (3), falls through to regular download
   error handling so the batch can try a completely different peer

5. Does NOT penalize peers for rate-limiting (no reportPeer call) since
   the peer is healthy but throttling us — penalizing makes things worse

Closes ChainSafe#8033

🤖 Generated with AI assistance (Lodekeeper)
@lodekeeper lodekeeper requested a review from a team as a code owner February 17, 2026 19:20
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @lodekeeper, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical issue in the range synchronization process where rate-limited responses from peers could lead to a "death spiral" of retries, peer disconnections, and network instability. The changes introduce a robust mechanism to specifically detect and handle rate-limiting errors by tracking them separately, applying an exponential backoff strategy, and avoiding punitive peer scoring. This ensures that the system can gracefully recover from temporary peer throttling without degrading overall network health or exhausting retry attempts prematurely.

Highlights

  • Rate-Limit Detection: Introduced a new DownloadByRangeErrorCode.RATE_LIMITED and a helper function isRateLimitRequestError() to distinctly identify rate-limiting responses, preventing them from being treated as generic request-response errors.
  • Separate Rate-Limit Tracking: Implemented a rateLimitedAttempts counter within the Batch class to track consecutive rate-limited download attempts independently, ensuring they do not consume the MAX_BATCH_DOWNLOAD_ATTEMPTS budget.
  • Exponential Backoff Strategy: Applied an exponential backoff mechanism (50ms → 100ms → 200ms) for rate-limited batches before retrying, and crucially, removed peer penalties for these events, recognizing the peer is throttling rather than failing.
  • New Configuration Constants: Defined MAX_RATE_LIMITED_RETRIES, RATE_LIMITED_INITIAL_DELAY_MS, and RATE_LIMITED_MAX_DELAY_MS to configure the new rate-limiting retry and backoff behavior.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • packages/beacon-node/src/sync/constants.ts
    • Added new constants MAX_RATE_LIMITED_RETRIES, RATE_LIMITED_INITIAL_DELAY_MS, and RATE_LIMITED_MAX_DELAY_MS to configure rate-limiting behavior.
  • packages/beacon-node/src/sync/range/batch.ts
    • Imported new rate-limiting constants.
    • Added a rateLimitedAttempts private property to track consecutive rate-limited failures.
    • Modified downloadingSuccess and downloadingError methods to reset the rateLimitedAttempts counter.
    • Introduced a new downloadingRateLimited method to handle rate-limited responses, incrementing its dedicated counter and falling through to regular error handling after MAX_RATE_LIMITED_RETRIES.
  • packages/beacon-node/src/sync/range/chain.ts
    • Imported new rate-limiting constants.
    • Refactored the error handling logic within triggerBatchDownloader to specifically detect RATE_LIMITED errors.
    • Implemented exponential backoff for rate-limited batches and prevented peer penalties for these errors.
  • packages/beacon-node/src/sync/utils/downloadByRange.ts
    • Imported RequestErrorCode for identifying rate-limit specific errors.
    • Added DownloadByRangeErrorCode.RATE_LIMITED to categorize rate-limiting responses.
    • Created isRateLimitRequestError helper function to check for various rate-limiting request error codes.
    • Updated the downloadByRange function to catch rate-limiting errors and re-throw them as DownloadByRangeErrorCode.RATE_LIMITED.
  • packages/beacon-node/test/unit/sync/range/batch.test.ts
    • Imported new rate-limiting constants.
    • Added six new unit tests to verify the behavior of downloadingRateLimited, including attempt counting, fall-through logic, combined failure tracking, and counter resets.
Activity
  • Six new unit tests were added to batch.test.ts to cover the new rate-limited download handling logic.
  • All existing tests (76 total across the sync/range test suite) passed successfully.
  • The pull request addresses and closes issue PeerDAS: rate limit when syncing #8033.
  • The pull request was generated with AI assistance.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an exponential backoff mechanism for rate-limited range sync downloads, which is a great improvement to prevent nodes from being penalized and disconnected due to temporary rate limits. The changes are well-structured across constants, batch management, chain synchronization logic, and error handling. The addition of specific tests for the new rate-limiting logic is also commendable. I've found one area for improvement regarding code duplication in batch.ts, which could be refactored for better maintainability. Overall, this is a solid contribution that addresses a real-world issue.

The batch was transitioning to AwaitingDownload before the backoff
sleep, allowing other triggerBatchDownloader() calls to pick it up
immediately during the delay — bypassing the intended backoff.

Now the batch stays in Downloading state while sleeping, and only
transitions to AwaitingDownload after the delay completes. This
prevents the race condition where concurrent batch completions could
re-trigger a rate-limited batch before its cooldown expires.
@nflaig
Copy link
Member

nflaig commented Feb 17, 2026

@codex review

Per review feedback: when rate-limit retries are exhausted, call
downloadingError() directly instead of duplicating its logic.
@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. More of your lovely PRs please.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@matthewkeil
Copy link
Member

matthewkeil commented Feb 20, 2026

very happy to have you back on github @lodekeeper !!! I will try to take a peek at this over the weekend. If I dont get back to you by then please feel free to ping me on discord in the thread and remind me next week

Copy link
Contributor Author

@lodekeeper lodekeeper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lodekeeper review — 4 persona reviewers (🐛 bug hunter, 🔒 security, 🧙 wisdom, 🏗️ architect).

✅ No functional bugs found
⚠️ 2 Medium security findings (trust boundary)
💡 1 convergent design suggestion (wisdom + architect independently flagged same issue)

@lodekeeper
Copy link
Contributor Author

Thanks Matthew! Appreciate the warm welcome back 🙏 No rush — take your time and ping me if you have any questions about the approach.

Copy link
Member

@matthewkeil matthewkeil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lodekeeper and there are a few structural things I would like to address. I do not like the mixing of concerns between the Batch and SyncChain classes. I think the rate limiting management should lie within the Batch and the SyncChain will just handle the retry logic when the cool down expires. here is some pseudo code suggestions. this may not catch all edge cases but feels like a good start to get you going

batch.ts

export type BatchState =
  | AwaitingDownloadState
  | {status: BatchStatus.Downloading; peer: PeerIdStr; blocks: IBlockInput[]}
  | {status: BatchStatus.RateLimited; blocks: IBlockInput[]}
  | DownloadSuccessState
  | {status: BatchStatus.Processing; blocks: IBlockInput[]; attempt: Attempt}
  | {status: BatchStatus.AwaitingValidation; blocks: IBlockInput[]; attempt: Attempt};

class Batch {
  /** The number of download retries this batch has undergone due to a failed request. */
  readonly rateLimitedAttempts: PeerIdStr[] = [];

  downloadingRateLimited(peer: PeerIdStr): number {
    if (this.state.status !== BatchStatus.Downloading) {
      throw new BatchError(this.wrongStatusErrorType(BatchStatus.Downloading));
    }

    if (this.rateLimitedAttempts.length > MAX_RATE_LIMITED_RETRIES) {
      this.downloadingError(peer);
      return 0;
    }

    this.rateLimitedAttempts.push(peer);

    this.state = {status: BatchStatus.RateLimited, blocks: this.state.blocks};

    const coolDown = Math.min(
      RATE_LIMITED_INITIAL_DELAY_MS * 2 ** (this.rateLimitedAttempts.length - 1),
      RATE_LIMITED_MAX_DELAY_MS
    );

    return coolDown;
  }

  endCoolDown(): void {
    if (this.state.status !== BatchStatus.RateLimited) {
      throw new BatchError(this.wrongStatusErrorType(BatchStatus.RateLimited));
    }

    this.state = {status: BatchStatus.AwaitingDownload, blocks: this.state.blocks};
  }
}

chain.ts

      // wrapError ensures to never call both batch success() and batch error()
      const res = await wrapError(this.downloadByRange(peer, batch, this.syncType));

      if (res.err) {
        // There's several known error cases where we want to take action on the peer
        const errCode = (res.err as LodestarError<{code: string}>).type?.code;
        this.metrics?.syncRange.downloadByRange.error.inc({client: peer.client, code: errCode ?? "UNKNOWN"});

        if (errCode === DownloadByRangeErrorCode.RATE_LIMITED) {
          const delayMs = batch.downloadingRateLimited(peer);
          this.logger.debug("Batch download rate limited, backing off", {
            id: this.logId,
            ...batch.getMetadata(),
            peer: prettyPrintPeerIdStr(peer.peerId),
            delayMs,
          });
          if (delayMs !== 0) {
            await new Promise((r) => setTimeout(r, delayMs));
            batch.endCoolDown();
          }
        }

- Add BatchStatus.RateLimited state for explicit cooldown tracking
- Move backoff delay computation into Batch.downloadingRateLimited()
- Add Batch.endCoolDown() for RateLimited -> AwaitingDownload transition
- Make rateLimitedAttempts private (encapsulation)
- Simplify SyncChain to only call batch methods (no more duplicate retry logic)
- Update batch ordering helpers to recognize RateLimited as pre-processing
- Count RateLimited batches in buffer to prevent over-requesting

🤖 Generated with AI assistance
@lodekeeper
Copy link
Contributor Author

Pushed the structural refactoring in 29fdc5c:

Summary of changes:

  • Added BatchStatus.RateLimited state for explicit cooldown tracking
  • Batch.downloadingRateLimited() now owns the backoff delay computation (returns delayMs, 0 = retries exhausted)
  • Added Batch.endCoolDown() for RateLimited → AwaitingDownload transition
  • rateLimitedAttempts is now private (matches encapsulation of failedDownloadAttempts)
  • SyncChain simplified to just: delayMs = batch.downloadingRateLimited(peer) → sleep → batch.endCoolDown()
  • Removed MAX_RATE_LIMITED_RETRIES, RATE_LIMITED_INITIAL_DELAY_MS, RATE_LIMITED_MAX_DELAY_MS imports from chain.ts
  • Updated batch ordering helpers + buffer count to recognize RateLimited as pre-processing state
  • All tests updated, lint + check-types pass

@matthewkeil let me know if this matches what you had in mind.

Copy link
Member

@matthewkeil matthewkeil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting closer. A couple nits

Per review feedback — use the helper function directly instead of comparing
against DownloadByRangeErrorCode.RATE_LIMITED.

🤖 Generated with AI assistance
@lodekeeper
Copy link
Contributor Author

Re: else { if (...) vs else if — I think the current pattern is correct here. The inner if (this.syncType === RangeSyncType.Finalized) only gates finalized-specific penalties (MISMATCHED_ROOT_HEX, MISSING_BLOBS, etc.).

The code after that block runs for ALL non-rate-limited errors:

  • Second switch → penalties that apply to both sync types
  • logger.verbose(...) → log all download errors
  • batch.downloadingError() → track failure + throw after max attempts

If it were else if, head sync errors would skip all of that — no logging, no failure tracking, no penalties.

} else {
  if (finalized) { /* finalized-only penalties */ }
  switch { /* penalties for both sync types */ }
  logger.verbose(...)      // runs for all errors
  batch.downloadingError() // runs for all errors
}

This was the pre-existing pattern before my changes. Want me to add a clarifying comment to make it more obvious?

Let rate-limited reqresp errors propagate with their original error code
instead of wrapping them in a DownloadByRangeError with RATE_LIMITED.
This way chain.ts can detect them via isRateLimitRequestError() directly,
matching the simplified approach suggested in review.

Remove the now-unused RATE_LIMITED variant from DownloadByRangeErrorCode.
@matthewkeil
Copy link
Member

if (finalized) { /* finalized-only penalties / }
switch { /
penalties for both sync types */ }

Note that there is no overlap in the codes that trigger both.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds explicit handling for req/resp rate-limiting during range sync so batches don’t immediately re-request and spiral into repeated rate-limit failures and peer disconnections (issue #8033).

Changes:

  • Detect and propagate req/resp rate-limit error codes so range sync can treat them specially.
  • Introduce a RateLimited batch state with separate retry tracking and exponential backoff before retrying.
  • Add new constants and unit tests covering rate-limit state transitions, counters, and fall-through behavior.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
packages/beacon-node/src/sync/utils/downloadByRange.ts Preserves req/resp rate-limit error codes and adds isRateLimitRequestError() helper.
packages/beacon-node/src/sync/range/batch.ts Adds BatchStatus.RateLimited, separate rate-limit retry counter, and cooldown transitions.
packages/beacon-node/src/sync/range/chain.ts Applies rate-limit backoff and avoids peer penalties/log duplication for rate-limited responses.
packages/beacon-node/src/sync/range/utils/batches.ts Updates batch status validation / processing selection to account for RateLimited.
packages/beacon-node/src/sync/constants.ts Adds rate-limit retry/backoff constants.
packages/beacon-node/test/unit/sync/range/batch.test.ts Adds unit tests for rate-limit handling and retry/backoff behavior.
packages/beacon-node/test/unit/sync/range/utils/batches.test.ts Updates status validation tests to include RateLimited.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Changed rateLimitedAttempts counter to rateLimitedPeers list (PeerIdStr[])
- Include rate-limited peers in getFailedPeers() so peerBalancer prefers
  alternative peers when retrying rate-limited batches
- Removed inline sleep — batch transitions back to AwaitingDownload
  immediately after rate-limiting, letting triggerBatchDownloader select
  a different peer via the normal retry cycle
- Added tests for peer tracking and getFailedPeers integration
- Remove outdated alternative-peer phrasing from RATE_LIMITED_INITIAL_DELAY_MS doc
- Update rateLimitedPeers field comment to reflect reset on both success and non-rate-limit errors
Copy link
Member

@matthewkeil matthewkeil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of small changes @lodekeeper . Almost there 🚀

Copy link
Member

@matthewkeil matthewkeil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lodekeeper this needs to be fixed still

// a different peer. Rate-limited peers are tracked in getFailedPeers(),
// so peerBalancer will prefer alternative peers. If no alternative is available
// the backoff delay is applied before retrying with the same peer pool.
await sleep(delayMs);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The delay needs to go into requestBatches I think. Likely the peer selection, with filtering of rate limited peers, will happen in bestPeerToRetryBatch and/or idlePeerForBatch but I leave that up to you to implement. You also need to figure out how to get the delay time into that context so if no other peers are available the batch waits before proceeding. I'm guessing you do not want to return delayMs from downloadingRateLimited but instead store it in the Batch so that its available to the peerBalancer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You’re right that this needs to be enforced at scheduling boundaries, but the effect is equivalent with the current layout: on rate-limit we now set BatchStatus.RateLimited, apply await sleep(delayMs) there, then batch.endCoolDown() transitions back to AwaitingDownload. requestBatches() only picks up that batch after it exits rate-limited state and peer selection still goes through bestPeerToRetryBatch/idlePeerForBatch, which already deprioritizes recent rateLimitedPeers from getFailedPeers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented this in ded88f4423 by moving cooldown timing out of sendBatch and into the request/retry flow.

What changed

  • Batch now stores cooldown deadline (rateLimitCoolDownUntilMs) and exposes getRateLimitCoolDownRemainingMs(peer).
  • sendBatch no longer awaits sleep for rate-limit responses. It just marks rate-limited state + schedules a downloader wakeup.
  • requestBatches() now handles both AwaitingDownload and RateLimited batches:
    • asks peerBalancer for best retry peer,
    • if selected peer is still in cooldown, skips and reschedules wakeup,
    • if cooldown elapsed (or selected peer is not in rate-limited set), transitions with endCoolDown() and retries immediately.
  • peerBalancer now accepts RateLimited batches for peer selection, so alternative peers can be chosen without blocking the send path.
  • Added cleanup for cooldown timers on batch removal / chain end / chain removal.

Validation run:

  • pnpm lint
  • pnpm test:unit .../sync/range/batch.test.ts .../sync/range/utils/peerBalancer.test.ts .../sync/range/chain.test.ts
  • pnpm check-types

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PeerDAS: rate limit when syncing

4 participants