Conversation
linera-rpc/src/lib.rs
Outdated
| pub const KEY_PEM: &str = include_str!(concat!(env!("OUT_DIR"), "/private_key.pem")); | ||
|
|
||
| /// Maximum bit-shift exponent for u32: `1u32 << 31` is the largest valid shift before overflow. | ||
| const MAX_SHIFT_EXPONENT: u32 = 31; |
There was a problem hiding this comment.
Please no hardcoded constants.
There was a problem hiding this comment.
This one seems quite sensible. I can't imagine us changing 👀
2cbebb4 to
8244a6e
Compare
linera-rpc/src/lib.rs
Outdated
| max_backoff: std::time::Duration, | ||
| ) -> std::time::Duration { | ||
| use rand::Rng as _; | ||
| let exponential_delay = base_delay.saturating_mul(1u32 << attempt.min(MAX_SHIFT_EXPONENT)); |
There was a problem hiding this comment.
We could use 1u32.overflowing_shl(attempt).unwrap_or(u32::MAX). Then we don't need the constant.
Or even 2u32.saturating_pow(attempt)? I'm sure the compiler will realize it's a power of 2!
There was a problem hiding this comment.
I had claude create some test code and generate the assembly for it, and surprisingly the compiler (at least on edition 2021) doesn't optimize 2u32.saturating_pow(attempt) to a
shift, it generates a long branching sequence.
TIL though about 1u32.checked_shl(attempt).unwrap_or(u32::MAX), which literally does the same thing I was doing by hand ha and compiles to the
same 6-instruction sequence as the original shift, so we should be good.
8244a6e to
6ed001a
Compare
) The retry logic in gRPC client requests, gRPC subscription reconnection, and cross-chain message forwarding all used linear backoff without jitter (`delay * retry_count`). This creates a [thundering herd](https://en.wikipedia.org/wiki/Thundering_herd_problem) risk: when a validator goes down and comes back up, all clients that were retrying wake up at nearly the same time and hit the recovering validator simultaneously, potentially bringing it down again. This happens because linear backoff is deterministic — every client on the same retry count sleeps the exact same duration, so their retries synchronize into bursts. Replace all three retry sites with [Full Jitter exponential backoff](https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/), the industry-standard approach recommended by [AWS](https://docs.aws.amazon.com/sdkref/latest/guide/feature-retry-behavior.html), [Google Cloud](https://cloud.google.com/storage/docs/retry-rategy), and the [gRPC spec](https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md). The formula is: `sleep = random(0, min(cap, base * 2^attempt))`. - Exponential growth spaces retries further apart over time - Full randomization (jitter across the entire range, not just an additive offset) desynchronizes clients so they spread their retries evenly instead of clustering - A fixed 30s cap prevents excessive wait times (Google Cloud uses 30s, AWS uses 20s) - CI - These changes should be backported to the latest `testnet` branch, then - be released in a validator hotfix.
) The retry logic in gRPC client requests, gRPC subscription reconnection, and cross-chain message forwarding all used linear backoff without jitter (`delay * retry_count`). This creates a [thundering herd](https://en.wikipedia.org/wiki/Thundering_herd_problem) risk: when a validator goes down and comes back up, all clients that were retrying wake up at nearly the same time and hit the recovering validator simultaneously, potentially bringing it down again. This happens because linear backoff is deterministic — every client on the same retry count sleeps the exact same duration, so their retries synchronize into bursts. Replace all three retry sites with [Full Jitter exponential backoff](https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/), the industry-standard approach recommended by [AWS](https://docs.aws.amazon.com/sdkref/latest/guide/feature-retry-behavior.html), [Google Cloud](https://cloud.google.com/storage/docs/retry-rategy), and the [gRPC spec](https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md). The formula is: `sleep = random(0, min(cap, base * 2^attempt))`. - Exponential growth spaces retries further apart over time - Full randomization (jitter across the entire range, not just an additive offset) desynchronizes clients so they spread their retries evenly instead of clustering - A fixed 30s cap prevents excessive wait times (Google Cloud uses 30s, AWS uses 20s) - CI - These changes should be backported to the latest `testnet` branch, then - be released in a validator hotfix.

Motivation
The retry logic in gRPC client requests, gRPC subscription reconnection, and cross-chain message forwarding all used
linear backoff without jitter (
delay * retry_count). This creates a thundering herd risk: when a validator goes down and comes back up, all clients that were retrying wake up at nearly the same time and hit the recovering validator simultaneously, potentially bringing it down again.This happens because linear backoff is deterministic — every client on the same retry count sleeps the exact same
duration, so their retries synchronize into bursts.
Proposal
Replace all three retry sites with Full Jitter exponential backoff, the industry-standard approach recommended by AWS, Google Cloud, and the gRPC spec.
The formula is:
sleep = random(0, min(cap, base * 2^attempt)).spread their retries evenly instead of clustering
Test Plan
Release Plan
testnetbranch, then