You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Replace linear retry backoff with full jitter exponential backoff (#5440)
The retry logic in gRPC client requests, gRPC subscription reconnection,
and cross-chain message forwarding all used
linear backoff without jitter (`delay * retry_count`). This creates a
[thundering herd](https://en.wikipedia.org/wiki/Thundering_herd_problem)
risk: when a validator goes down and comes back up, all clients that
were retrying wake up at nearly the same time and hit the recovering
validator simultaneously, potentially bringing it down again.
This happens because linear backoff is deterministic — every client on
the same retry count sleeps the exact same
duration, so their retries synchronize into bursts.
Replace all three retry sites with [Full Jitter exponential
backoff](https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/),
the industry-standard approach recommended by
[AWS](https://docs.aws.amazon.com/sdkref/latest/guide/feature-retry-behavior.html),
[Google Cloud](https://cloud.google.com/storage/docs/retry-rategy), and
the [gRPC
spec](https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md).
The formula is: `sleep = random(0, min(cap, base * 2^attempt))`.
- Exponential growth spaces retries further apart over time
- Full randomization (jitter across the entire range, not just an
additive offset) desynchronizes clients so they
spread their retries evenly instead of clustering
- A fixed 30s cap prevents excessive wait times (Google Cloud uses 30s,
AWS uses 20s)
- CI
- These changes should be backported to the latest `testnet` branch,
then
- be released in a validator hotfix.
Copy file name to clipboardExpand all lines: CLI.md
+6Lines changed: 6 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -151,6 +151,9 @@ Client implementation and command-line tool for the Linera blockchain
151
151
*`--max-retries <MAX_RETRIES>` — Number of times to retry connecting to a validator
152
152
153
153
Default value: `10`
154
+
*`--max-backoff-ms <MAX_BACKOFF>` — Maximum backoff delay for retrying to connect to a validator
155
+
156
+
Default value: `30000`
154
157
*`--wait-for-outgoing-messages` — Whether to wait until a quorum of validators has confirmed that all sent cross-chain messages have been delivered
155
158
*`--allow-fast-blocks` — Whether to allow creating blocks in the fast round. Fast blocks have lower latency but must be used carefully so that there are never any conflicting fast block proposals
156
159
*`--long-lived-services` — (EXPERIMENTAL) Whether application services can persist in some cases between queries
@@ -1203,6 +1206,9 @@ Start a Local Linera Network
1203
1206
*`--cross-chain-retry-delay-ms <RETRY_DELAY_MS>` — Delay before retrying of cross-chain message
1204
1207
1205
1208
Default value: `2000`
1209
+
*`--cross-chain-max-backoff-ms <MAX_BACKOFF_MS>` — Maximum backoff delay for cross-chain message retries
1210
+
1211
+
Default value: `30000`
1206
1212
*`--cross-chain-sender-delay-ms <SENDER_DELAY_MS>` — Introduce a delay before sending every cross-chain message (e.g. for testing purpose)
0 commit comments