Skip to content

Replace linear retry backoff with full jitter exponential backoff#5440

Merged
ndr-ds merged 1 commit intomainfrom
ndr-ds/replace_linear_retry_backoff_with_full_jitter_exponential_backoff
Feb 17, 2026
Merged

Replace linear retry backoff with full jitter exponential backoff#5440
ndr-ds merged 1 commit intomainfrom
ndr-ds/replace_linear_retry_backoff_with_full_jitter_exponential_backoff

Conversation

@ndr-ds
Copy link
Copy Markdown
Contributor

@ndr-ds ndr-ds commented Feb 13, 2026

Motivation

The retry logic in gRPC client requests, gRPC subscription reconnection, and cross-chain message forwarding all used
linear backoff without jitter (delay * retry_count). This creates a thundering herd risk: when a validator goes down and comes back up, all clients that were retrying wake up at nearly the same time and hit the recovering validator simultaneously, potentially bringing it down again.

This happens because linear backoff is deterministic — every client on the same retry count sleeps the exact same
duration, so their retries synchronize into bursts.

Proposal

Replace all three retry sites with Full Jitter exponential backoff, the industry-standard approach recommended by AWS, Google Cloud, and the gRPC spec.

The formula is: sleep = random(0, min(cap, base * 2^attempt)).

  • Exponential growth spaces retries further apart over time
  • Full randomization (jitter across the entire range, not just an additive offset) desynchronizes clients so they
    spread their retries evenly instead of clustering
  • A fixed 30s cap prevents excessive wait times (Google Cloud uses 30s, AWS uses 20s)

Test Plan

  • CI

Release Plan

  • These changes should be backported to the latest testnet branch, then
    • be released in a validator hotfix.

Copy link
Copy Markdown
Contributor Author

ndr-ds commented Feb 13, 2026

This stack of pull requests is managed by Graphite. Learn more about stacking.

pub const KEY_PEM: &str = include_str!(concat!(env!("OUT_DIR"), "/private_key.pem"));

/// Maximum bit-shift exponent for u32: `1u32 << 31` is the largest valid shift before overflow.
const MAX_SHIFT_EXPONENT: u32 = 31;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please no hardcoded constants.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one seems quite sensible. I can't imagine us changing 👀

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit more than 32 -1

@ndr-ds ndr-ds force-pushed the ndr-ds/replace_linear_retry_backoff_with_full_jitter_exponential_backoff branch 2 times, most recently from 2cbebb4 to 8244a6e Compare February 16, 2026 18:23
@ndr-ds ndr-ds requested review from Twey, afck, deuszx and ma2bd February 16, 2026 18:23
max_backoff: std::time::Duration,
) -> std::time::Duration {
use rand::Rng as _;
let exponential_delay = base_delay.saturating_mul(1u32 << attempt.min(MAX_SHIFT_EXPONENT));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use 1u32.overflowing_shl(attempt).unwrap_or(u32::MAX). Then we don't need the constant.

Or even 2u32.saturating_pow(attempt)? I'm sure the compiler will realize it's a power of 2!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had claude create some test code and generate the assembly for it, and surprisingly the compiler (at least on edition 2021) doesn't optimize 2u32.saturating_pow(attempt) to a
shift, it generates a long branching sequence.
TIL though about 1u32.checked_shl(attempt).unwrap_or(u32::MAX), which literally does the same thing I was doing by hand ha and compiles to the
same 6-instruction sequence as the original shift, so we should be good.

@ndr-ds ndr-ds force-pushed the ndr-ds/replace_linear_retry_backoff_with_full_jitter_exponential_backoff branch from 8244a6e to 6ed001a Compare February 17, 2026 13:54
@ndr-ds ndr-ds added this pull request to the merge queue Feb 17, 2026
Merged via the queue into main with commit c11027b Feb 17, 2026
35 checks passed
@ndr-ds ndr-ds deleted the ndr-ds/replace_linear_retry_backoff_with_full_jitter_exponential_backoff branch February 17, 2026 15:29
ndr-ds added a commit that referenced this pull request Feb 17, 2026
)

The retry logic in gRPC client requests, gRPC subscription reconnection,
and cross-chain message forwarding all used
linear backoff without jitter (`delay * retry_count`). This creates a
[thundering herd](https://en.wikipedia.org/wiki/Thundering_herd_problem)
risk: when a validator goes down and comes back up, all clients that
were retrying wake up at nearly the same time and hit the recovering
validator simultaneously, potentially bringing it down again.

This happens because linear backoff is deterministic — every client on
the same retry count sleeps the exact same
duration, so their retries synchronize into bursts.

Replace all three retry sites with [Full Jitter exponential
backoff](https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/),
the industry-standard approach recommended by
[AWS](https://docs.aws.amazon.com/sdkref/latest/guide/feature-retry-behavior.html),
[Google Cloud](https://cloud.google.com/storage/docs/retry-rategy), and
the [gRPC
spec](https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md).

The formula is: `sleep = random(0, min(cap, base * 2^attempt))`.

- Exponential growth spaces retries further apart over time
- Full randomization (jitter across the entire range, not just an
additive offset) desynchronizes clients so they
spread their retries evenly instead of clustering
- A fixed 30s cap prevents excessive wait times (Google Cloud uses 30s,
AWS uses 20s)

- CI

- These changes should be backported to the latest `testnet` branch,
then
    - be released in a validator hotfix.
ndr-ds added a commit that referenced this pull request Feb 17, 2026
)

The retry logic in gRPC client requests, gRPC subscription reconnection,
and cross-chain message forwarding all used
linear backoff without jitter (`delay * retry_count`). This creates a
[thundering herd](https://en.wikipedia.org/wiki/Thundering_herd_problem)
risk: when a validator goes down and comes back up, all clients that
were retrying wake up at nearly the same time and hit the recovering
validator simultaneously, potentially bringing it down again.

This happens because linear backoff is deterministic — every client on
the same retry count sleeps the exact same
duration, so their retries synchronize into bursts.

Replace all three retry sites with [Full Jitter exponential
backoff](https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/),
the industry-standard approach recommended by
[AWS](https://docs.aws.amazon.com/sdkref/latest/guide/feature-retry-behavior.html),
[Google Cloud](https://cloud.google.com/storage/docs/retry-rategy), and
the [gRPC
spec](https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md).

The formula is: `sleep = random(0, min(cap, base * 2^attempt))`.

- Exponential growth spaces retries further apart over time
- Full randomization (jitter across the entire range, not just an
additive offset) desynchronizes clients so they
spread their retries evenly instead of clustering
- A fixed 30s cap prevents excessive wait times (Google Cloud uses 30s,
AWS uses 20s)

- CI

- These changes should be backported to the latest `testnet` branch,
then
    - be released in a validator hotfix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants