-
Notifications
You must be signed in to change notification settings - Fork 2.2k
When first request fails, start subsequent ones in parallel with increasing delay. #4913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: testnet_conway
Are you sure you want to change the base?
Conversation
7fe9a24 to
9e3300d
Compare
0b20882 to
e028663
Compare
c86030b to
acb5ad1
Compare
acb5ad1 to
7921bd7
Compare
7921bd7 to
6a9b348
Compare
6a9b348 to
6e9cc93
Compare
afck
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No blockers from my side!
| tracing::trace!(key = ?key, "executing new request"); | ||
| let result = operation(peer).await; | ||
| tracing::trace!(key = ?key, peer = ?peer, "executing new request"); | ||
| let result = operation(peer.clone()).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe for a future PR, but: would it make sense to also set a timeout here after which we retry with another peer, even if this operation hasn't returned an error yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could put all alternative peers under the staggered* function right away 🤔 a bit like we do in the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I'm gonna ticket this as there are some edge cases that I think will need more thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, done here Set timeout for the first request
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it conflicts with the other timeout feature (which starts a new request if the currently in flight one is delayed). I'm gonna revert this for now and do it in a separate PR.
| key = ?key, | ||
| "all staggered parallel retry attempts failed" | ||
| ); | ||
| Err(last_error.unwrap_or(NodeError::UnexpectedMessage)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This unwrap_or error case is actually unreachable, is it?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hopefully, yes. I can rewrite the code to get rid of this arm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| // Create futures with staggered delays | ||
| for (idx, peer) in peers.into_iter().enumerate() { | ||
| let delay = Duration::from_millis(staggered_delay_ms * idx as u64); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the duration should increase quadratically: The more peers we are already trying the more likely it is that the request is simply a very slow one, and it's not the peers' fault.
| // With staggered parallel: node0 fails immediately, node1 starts at 10ms (and takes 20ms), | ||
| // node2 starts at 20ms and succeeds at 25ms total | ||
| let total_time = Instant::now().duration_since(start_time).as_millis(); | ||
| assert!(total_time >= staggered_delay_ms && total_time < 50,); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I hope that test doesn't turn flaky. But I guess it's not feasible to use the fake clock here?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tried. I can ticket it for the future task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| let result = futures::select! { | ||
| _timeout = linera_base::time::timer::sleep(self.retry_delay).fuse() => { | ||
| tracing::trace!(key = ?key, "retry delay elapsed, proceeding with request"); | ||
| Err(NodeError::WorkerError { error: "timeout".to_string() }) // Placeholder error to trigger retries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we don't want to cancel the ongoing operation in that case? We just want to add it to the unordered futures, basically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, maybe you're right. I didn't add it to the method b/c it changes the other rule: peer node is the first one we request data from. To maintain both I'd have to rewrite code quite a lot.
This is reverted now anyway.
d0d2330 to
36ca3cb
Compare
36ca3cb to
b08186b
Compare
Motivation
We observed that when the first request failed, the failure was broadcasted to all waiting peers. This slowed down the process and didn't use alternative peers we expect to have the data.
Proposal
Detect if request fails and if so, try all alternative peers before erroring. We spawn a retry operation for every alternative peer with ever-increasing delay, delayed by
75msby default.Test Plan
CI (a test was added for this case).
Release Plan
testnetbranch, thenLinks