-
Notifications
You must be signed in to change notification settings - Fork 301
feat(fortuna): better retry mechanism #2780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
.get_request(event.provider_address, event.sequence_number) | ||
.await; | ||
|
||
tracing::error!("Failed to process event: {:?}. Request: {:?}", e, req); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this inside because it was creating a bunch of false alarms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please do fix the spacing on the retries but lgtm aside from that
apps/fortuna/src/eth_utils/utils.rs
Outdated
) | ||
.await | ||
.await; | ||
result.map_err(|e| error_mapper(num_retries, e)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(and then you don't need to pass error_mapper
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't understand this comment
apps/fortuna/src/eth_utils/utils.rs
Outdated
gas_limit: U256, | ||
escalation_policy: EscalationPolicy, | ||
) -> Result<SubmitTxResult> { | ||
error_mapper: impl Fn(u64, backoff::Error<SubmitTxError<T>>) -> backoff::Error<SubmitTxError<T>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you leave a comment that this lets you customize the backoff behavior based on the error type? It's not obvious what you get from this at the moment
if 1 < num_retries && num_retries < 5 { | ||
return backoff::Error::Transient { | ||
err, | ||
retry_after: Some(Duration::from_secs(60)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the spacing here needs to be a bit more granular. retry the first time after 5 seconds, then 10 seconds, then 60 seconds
these errors happen pretty frequently in the first 1-2 seconds because of RPC async issues. The current logic will significantly degrade the UX whenever this happens, because the callback will take 60 seconds now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this already kicks in on the 3rd attempt, but I will also increase the delay on the first 2 attempts.
Summary
If we deploy multiple instances of fortuna, keepers will compete to fulfil requests and only one of them will succeed making the callback. The other instances will continue retrying for 5 minutes and by default this can take up to 13 retries. Knowing that this will happen for every request, the RPC usage will increase substantially which is not acceptable.
This PR fixes this by:
retry_interval
and the number of retries.Another nice side-effect here is that we get better error msgs for explorer.
Rationale
To avoid excessive RPC usage
How has this been tested?
Ran two instances of Fortuna locally, created 10 requests on monad-testnet and verified the behavior.