Replace linear retry backoff with full jitter exponential backoff by ndr-ds · Pull Request #5440 · linera-io/linera-protocol

ndr-ds · 2026-02-13T19:54:13Z

Motivation

The retry logic in gRPC client requests, gRPC subscription reconnection, and cross-chain message forwarding all used
linear backoff without jitter (delay * retry_count). This creates a thundering herd risk: when a validator goes down and comes back up, all clients that were retrying wake up at nearly the same time and hit the recovering validator simultaneously, potentially bringing it down again.

This happens because linear backoff is deterministic — every client on the same retry count sleeps the exact same
duration, so their retries synchronize into bursts.

Proposal

Replace all three retry sites with Full Jitter exponential backoff, the industry-standard approach recommended by AWS, Google Cloud, and the gRPC spec.

The formula is: sleep = random(0, min(cap, base * 2^attempt)).

Exponential growth spaces retries further apart over time
Full randomization (jitter across the entire range, not just an additive offset) desynchronizes clients so they
spread their retries evenly instead of clustering
A fixed 30s cap prevents excessive wait times (Google Cloud uses 30s, AWS uses 20s)

Test Plan

CI

Release Plan

These changes should be backported to the latest testnet branch, then
- be released in a validator hotfix.

ndr-ds · 2026-02-13T19:54:27Z

Replace linear retry backoff with full jitter exponential backoff #5440 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

ma2bd · 2026-02-14T07:06:44Z

linera-rpc/src/lib.rs

 pub const KEY_PEM: &str = include_str!(concat!(env!("OUT_DIR"), "/private_key.pem"));
+
+/// Maximum bit-shift exponent for u32: `1u32 << 31` is the largest valid shift before overflow.
+const MAX_SHIFT_EXPONENT: u32 = 31;


Please no hardcoded constants.

This one seems quite sensible. I can't imagine us changing 👀

It's a bit more than 32 -1

linera-client/src/client_options.rs

afck · 2026-02-17T13:12:00Z

linera-rpc/src/lib.rs

+    max_backoff: std::time::Duration,
+) -> std::time::Duration {
+    use rand::Rng as _;
+    let exponential_delay = base_delay.saturating_mul(1u32 << attempt.min(MAX_SHIFT_EXPONENT));


We could use 1u32.overflowing_shl(attempt).unwrap_or(u32::MAX). Then we don't need the constant.

Or even 2u32.saturating_pow(attempt)? I'm sure the compiler will realize it's a power of 2!

I had claude create some test code and generate the assembly for it, and surprisingly the compiler (at least on edition 2021) doesn't optimize 2u32.saturating_pow(attempt) to a
shift, it generates a long branching sequence.
TIL though about 1u32.checked_shl(attempt).unwrap_or(u32::MAX), which literally does the same thing I was doing by hand ha and compiles to the
same 6-instruction sequence as the original shift, so we should be good.

) The retry logic in gRPC client requests, gRPC subscription reconnection, and cross-chain message forwarding all used linear backoff without jitter (`delay * retry_count`). This creates a [thundering herd](https://en.wikipedia.org/wiki/Thundering_herd_problem) risk: when a validator goes down and comes back up, all clients that were retrying wake up at nearly the same time and hit the recovering validator simultaneously, potentially bringing it down again. This happens because linear backoff is deterministic — every client on the same retry count sleeps the exact same duration, so their retries synchronize into bursts. Replace all three retry sites with [Full Jitter exponential backoff](https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/), the industry-standard approach recommended by [AWS](https://docs.aws.amazon.com/sdkref/latest/guide/feature-retry-behavior.html), [Google Cloud](https://cloud.google.com/storage/docs/retry-rategy), and the [gRPC spec](https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md). The formula is: `sleep = random(0, min(cap, base * 2^attempt))`. - Exponential growth spaces retries further apart over time - Full randomization (jitter across the entire range, not just an additive offset) desynchronizes clients so they spread their retries evenly instead of clustering - A fixed 30s cap prevents excessive wait times (Google Cloud uses 30s, AWS uses 20s) - CI - These changes should be backported to the latest `testnet` branch, then - be released in a validator hotfix.

ma2bd reviewed Feb 14, 2026

View reviewed changes

ndr-ds force-pushed the ndr-ds/replace_linear_retry_backoff_with_full_jitter_exponential_backoff branch 2 times, most recently from 2cbebb4 to 8244a6e Compare February 16, 2026 18:23

ndr-ds requested review from Twey, afck, deuszx and ma2bd February 16, 2026 18:23

deuszx reviewed Feb 17, 2026

View reviewed changes

linera-client/src/client_options.rs Show resolved Hide resolved

deuszx approved these changes Feb 17, 2026

View reviewed changes

afck reviewed Feb 17, 2026

View reviewed changes

Replace linear retry backoff with full jitter exponential backoff

6ed001a

ndr-ds force-pushed the ndr-ds/replace_linear_retry_backoff_with_full_jitter_exponential_backoff branch from 8244a6e to 6ed001a Compare February 17, 2026 13:54

afck approved these changes Feb 17, 2026

View reviewed changes

ndr-ds added this pull request to the merge queue Feb 17, 2026

Merged via the queue into main with commit c11027b Feb 17, 2026
35 checks passed

ndr-ds deleted the ndr-ds/replace_linear_retry_backoff_with_full_jitter_exponential_backoff branch February 17, 2026 15:29

ndr-ds mentioned this pull request Feb 17, 2026

Backport of #5439, #5441, #5440 #5462

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace linear retry backoff with full jitter exponential backoff#5440

Replace linear retry backoff with full jitter exponential backoff#5440
ndr-ds merged 1 commit intomainfrom
ndr-ds/replace_linear_retry_backoff_with_full_jitter_exponential_backoff

ndr-ds commented Feb 13, 2026 •

edited

Loading

Uh oh!

ndr-ds commented Feb 13, 2026 •

edited

Loading

Uh oh!

ma2bd Feb 14, 2026

Uh oh!

deuszx Feb 17, 2026

Uh oh!

ma2bd Feb 17, 2026

Uh oh!

Uh oh!

afck Feb 17, 2026

Uh oh!

ndr-ds Feb 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ndr-ds commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Proposal

Test Plan

Release Plan

Uh oh!

ndr-ds commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ma2bd Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

deuszx Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

ma2bd Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

afck Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

ndr-ds Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ndr-ds commented Feb 13, 2026 •

edited

Loading

ndr-ds commented Feb 13, 2026 •

edited

Loading