Skip to content

Conversation

m30m
Copy link
Contributor

@m30m m30m commented Jun 12, 2025

Summary

If we deploy multiple instances of fortuna, keepers will compete to fulfil requests and only one of them will succeed making the callback. The other instances will continue retrying for 5 minutes and by default this can take up to 13 retries. Knowing that this will happen for every request, the RPC usage will increase substantially which is not acceptable.

This PR fixes this by:

  1. Expose the underlying errors in an inspectable manner instead of putting everything inside anyhow
  2. Expose an error mapper which can customise the error returned before retrying. Using this mechanism, we customise the retry_interval and the number of retries.

Another nice side-effect here is that we get better error msgs for explorer.

Rationale

To avoid excessive RPC usage

How has this been tested?

  • Current tests cover my changes
  • Added new tests
  • Manually tested the code

Ran two instances of Fortuna locally, created 10 requests on monad-testnet and verified the behavior.

Copy link

vercel bot commented Jun 12, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
api-reference ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 13, 2025 6:13pm
component-library ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 13, 2025 6:13pm
developer-hub ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 13, 2025 6:13pm
entropy-debugger ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 13, 2025 6:13pm
entropy-explorer ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 13, 2025 6:13pm
insights ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 13, 2025 6:13pm
proposals ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 13, 2025 6:13pm
staking ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 13, 2025 6:13pm

.get_request(event.provider_address, event.sequence_number)
.await;

tracing::error!("Failed to process event: {:?}. Request: {:?}", e, req);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this inside because it was creating a bunch of false alarms.

Copy link
Contributor

@jayantk jayantk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do fix the spacing on the retries but lgtm aside from that

)
.await
.await;
result.map_err(|e| error_mapper(num_retries, e))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and then you don't need to pass error_mapper)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand this comment

gas_limit: U256,
escalation_policy: EscalationPolicy,
) -> Result<SubmitTxResult> {
error_mapper: impl Fn(u64, backoff::Error<SubmitTxError<T>>) -> backoff::Error<SubmitTxError<T>>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you leave a comment that this lets you customize the backoff behavior based on the error type? It's not obvious what you get from this at the moment

if 1 < num_retries && num_retries < 5 {
return backoff::Error::Transient {
err,
retry_after: Some(Duration::from_secs(60)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the spacing here needs to be a bit more granular. retry the first time after 5 seconds, then 10 seconds, then 60 seconds

these errors happen pretty frequently in the first 1-2 seconds because of RPC async issues. The current logic will significantly degrade the UX whenever this happens, because the callback will take 60 seconds now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this already kicks in on the 3rd attempt, but I will also increase the delay on the first 2 attempts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants