Skip to content

Conversation

@deuszx
Copy link
Contributor

@deuszx deuszx commented Nov 5, 2025

Motivation

We observed that when the first request failed, the failure was broadcasted to all waiting peers. This slowed down the process and didn't use alternative peers we expect to have the data.

Proposal

Detect if request fails and if so, try all alternative peers before erroring. We spawn a retry operation for every alternative peer with ever-increasing delay, delayed by 75ms by default.

Test Plan

CI (a test was added for this case).

Release Plan

  • These changes should be backported to the latest testnet branch, then
    • be released in a new SDK,

Links

@deuszx deuszx force-pushed the deduplicate-handlechaininfo-query branch from 7fe9a24 to 9e3300d Compare November 6, 2025 13:38
@deuszx deuszx force-pushed the try-alternative-on-failure branch 2 times, most recently from 0b20882 to e028663 Compare November 7, 2025 10:21
@deuszx deuszx changed the base branch from deduplicate-handlechaininfo-query to testnet_conway November 7, 2025 10:23
@deuszx deuszx force-pushed the try-alternative-on-failure branch 2 times, most recently from c86030b to acb5ad1 Compare November 7, 2025 12:27
@deuszx deuszx changed the title Try alternative peers if previous requests fail. When first request fails, start subsequent ones in parallel with increasing delay. Nov 7, 2025
@deuszx deuszx force-pushed the try-alternative-on-failure branch from acb5ad1 to 7921bd7 Compare November 7, 2025 12:58
@deuszx deuszx marked this pull request as ready for review November 7, 2025 12:58
@deuszx deuszx force-pushed the try-alternative-on-failure branch from 7921bd7 to 6a9b348 Compare November 7, 2025 13:09
@deuszx deuszx force-pushed the try-alternative-on-failure branch from 6a9b348 to 6e9cc93 Compare November 7, 2025 13:10
Copy link
Contributor

@afck afck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blockers from my side!

tracing::trace!(key = ?key, "executing new request");
let result = operation(peer).await;
tracing::trace!(key = ?key, peer = ?peer, "executing new request");
let result = operation(peer.clone()).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for a future PR, but: would it make sense to also set a timeout here after which we retry with another peer, even if this operation hasn't returned an error yet?

Copy link
Contributor Author

@deuszx deuszx Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could put all alternative peers under the staggered* function right away 🤔 a bit like we do in the test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm gonna ticket this as there are some edge cases that I think will need more thought.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it conflicts with the other timeout feature (which starts a new request if the currently in flight one is delayed). I'm gonna revert this for now and do it in a separate PR.

key = ?key,
"all staggered parallel retry attempts failed"
);
Err(last_error.unwrap_or(NodeError::UnexpectedMessage))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This unwrap_or error case is actually unreachable, is it?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hopefully, yes. I can rewrite the code to get rid of this arm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


// Create futures with staggered delays
for (idx, peer) in peers.into_iter().enumerate() {
let delay = Duration::from_millis(staggered_delay_ms * idx as u64);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the duration should increase quadratically: The more peers we are already trying the more likely it is that the request is simply a very slow one, and it's not the peers' fault.

// With staggered parallel: node0 fails immediately, node1 starts at 10ms (and takes 20ms),
// node2 starts at 20ms and succeeds at 25ms total
let total_time = Instant::now().duration_since(start_time).as_millis();
assert!(total_time >= staggered_delay_ms && total_time < 50,);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I hope that test doesn't turn flaky. But I guess it's not feasible to use the fake clock here?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tried. I can ticket it for the future task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let result = futures::select! {
_timeout = linera_base::time::timer::sleep(self.retry_delay).fuse() => {
tracing::trace!(key = ?key, "retry delay elapsed, proceeding with request");
Err(NodeError::WorkerError { error: "timeout".to_string() }) // Placeholder error to trigger retries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we don't want to cancel the ongoing operation in that case? We just want to add it to the unordered futures, basically?

Copy link
Contributor Author

@deuszx deuszx Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, maybe you're right. I didn't add it to the method b/c it changes the other rule: peer node is the first one we request data from. To maintain both I'd have to rewrite code quite a lot.

This is reverted now anyway.

@deuszx deuszx force-pushed the try-alternative-on-failure branch from d0d2330 to 36ca3cb Compare November 7, 2025 15:04
@deuszx deuszx force-pushed the try-alternative-on-failure branch from 36ca3cb to b08186b Compare November 7, 2025 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants