When first request fails, start subsequent ones in parallel with increasing delay. #4913

deuszx · 2025-11-05T14:43:19Z

Motivation

We observed that when the first request failed, the failure was broadcasted to all waiting peers. This slowed down the process and didn't use alternative peers we expect to have the data.

Proposal

Detect if request fails and if so, try all alternative peers before erroring. We spawn a retry operation for every alternative peer with ever-increasing delay, delayed by 75ms by default.

Test Plan

CI (a test was added for this case).

Release Plan

These changes should be backported to the latest testnet branch, then
- be released in a new SDK,

Links

reviewer checklist

afck

No blockers from my side!

afck · 2025-11-07T13:21:53Z

linera-core/src/client/requests_scheduler/scheduler.rs

-        tracing::trace!(key = ?key, "executing new request");
-        let result = operation(peer).await;
+        tracing::trace!(key = ?key, peer = ?peer, "executing new request");
+        let result = operation(peer.clone()).await;


Maybe for a future PR, but: would it make sense to also set a timeout here after which we retry with another peer, even if this operation hasn't returned an error yet?

I guess we could put all alternative peers under the staggered* function right away 🤔 a bit like we do in the test.

Actually, I'm gonna ticket this as there are some edge cases that I think will need more thought.

Anyway, done here Set timeout for the first request

Ah, it conflicts with the other timeout feature (which starts a new request if the currently in flight one is delayed). I'm gonna revert this for now and do it in a separate PR.

afck · 2025-11-07T13:23:43Z

linera-core/src/client/requests_scheduler/scheduler.rs

+            key = ?key,
+            "all staggered parallel retry attempts failed"
+        );
+        Err(last_error.unwrap_or(NodeError::UnexpectedMessage))


(This unwrap_or error case is actually unreachable, is it?)

hopefully, yes. I can rewrite the code to get rid of this arm.

Done in Set timeout for the first request

linera-core/src/client/requests_scheduler/scheduler.rs

afck · 2025-11-07T13:25:23Z

linera-core/src/client/requests_scheduler/scheduler.rs

+
+        // Create futures with staggered delays
+        for (idx, peer) in peers.into_iter().enumerate() {
+            let delay = Duration::from_millis(staggered_delay_ms * idx as u64);


I wonder if the duration should increase quadratically: The more peers we are already trying the more likely it is that the request is simply a very slow one, and it's not the peers' fault.

afck · 2025-11-07T13:26:29Z

linera-core/src/client/requests_scheduler/scheduler.rs

+        // With staggered parallel: node0 fails immediately, node1 starts at 10ms (and takes 20ms),
+        // node2 starts at 20ms and succeeds at 25ms total
+        let total_time = Instant::now().duration_since(start_time).as_millis();
+        assert!(total_time >= staggered_delay_ms && total_time < 50,);


(I hope that test doesn't turn flaky. But I guess it's not feasible to use the fake clock here?)

I haven't tried. I can ticket it for the future task.

afck · 2025-11-07T15:04:07Z

linera-core/src/client/requests_scheduler/scheduler.rs

+        let result = futures::select! {
+            _timeout = linera_base::time::timer::sleep(self.retry_delay).fuse() => {
+                tracing::trace!(key = ?key, "retry delay elapsed, proceeding with request");
+                Err(NodeError::WorkerError { error: "timeout".to_string() }) // Placeholder error to trigger retries


But we don't want to cancel the ongoing operation in that case? We just want to add it to the unordered futures, basically?

Yes, maybe you're right. I didn't add it to the method b/c it changes the other rule: peer node is the first one we request data from. To maintain both I'd have to rewrite code quite a lot.

This is reverted now anyway.

deuszx force-pushed the deduplicate-handlechaininfo-query branch from 7fe9a24 to 9e3300d Compare November 6, 2025 13:38

deuszx force-pushed the try-alternative-on-failure branch 2 times, most recently from 0b20882 to e028663 Compare November 7, 2025 10:21

deuszx changed the base branch from deduplicate-handlechaininfo-query to testnet_conway November 7, 2025 10:23

deuszx force-pushed the try-alternative-on-failure branch 2 times, most recently from c86030b to acb5ad1 Compare November 7, 2025 12:27

deuszx changed the title ~~Try alternative peers if previous requests fail.~~ When first request fails, start subsequent ones in parallel with increasing delay. Nov 7, 2025

deuszx force-pushed the try-alternative-on-failure branch from acb5ad1 to 7921bd7 Compare November 7, 2025 12:58

deuszx marked this pull request as ready for review November 7, 2025 12:58

deuszx force-pushed the try-alternative-on-failure branch from 7921bd7 to 6a9b348 Compare November 7, 2025 13:09

When first request fails, try alternative peers in staggered parallel.

6e9cc93

deuszx force-pushed the try-alternative-on-failure branch from 6a9b348 to 6e9cc93 Compare November 7, 2025 13:10

afck approved these changes Nov 7, 2025

View reviewed changes

deuszx mentioned this pull request Nov 7, 2025

Use fake clock to test time-based delays in RequestsScheduler. #4931

Open

afck reviewed Nov 7, 2025

View reviewed changes

deuszx force-pushed the try-alternative-on-failure branch from d0d2330 to 36ca3cb Compare November 7, 2025 15:04

Update default delay ms to 150.

b08186b

deuszx force-pushed the try-alternative-on-failure branch from 36ca3cb to b08186b Compare November 7, 2025 15:36

When first request fails, start subsequent ones in parallel with increasing delay. #4913

Are you sure you want to change the base?

When first request fails, start subsequent ones in parallel with increasing delay. #4913

Conversation

deuszx commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Proposal

Test Plan

Release Plan

Links

Uh oh!

afck left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deuszx Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deuszx Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

deuszx commented Nov 5, 2025 •

edited

Loading

deuszx Nov 7, 2025 •

edited

Loading

deuszx Nov 7, 2025 •

edited

Loading