Skip to content

Conversation

@tnull
Copy link
Contributor

@tnull tnull commented Aug 19, 2025

Based on #38.

We enable reqwest client-level timeouts:

While the `RetryPolicy` has a `MaxTotalDelayRetryPolicy`, the retry
`loop` would only check this configured delay once the operation future
actually returns a value. However, without client-side timeouts, we're
not super sure the operation is actually guaranteed to return anything
(even an error, IIUC).

So here, we enable some coarse client-side default timeouts to ensure
the polled futures eventualy return either the response *or* an error we
can handle via our retry logic.

~~Additionally, we here rip out our ~broken retry logic and replace it by utilizing reqwest's retry logic that shipped in the recent v0.12.23 release.~~

@tnull tnull requested review from jkczyz and tankyleo August 19, 2025 09:45
@ldk-reviews-bot
Copy link

ldk-reviews-bot commented Aug 19, 2025

👋 Thanks for assigning @tankyleo as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

@tnull tnull changed the title 2025 08 enable client side delays Enable client-side delays Aug 19, 2025
@tnull tnull changed the title Enable client-side delays Enable client-side timeouts Aug 19, 2025
src/client.rs Outdated
Comment on lines 36 to 38
.timeout(DEFAULT_TIMEOUT)
.connect_timeout(DEFAULT_TIMEOUT)
.read_timeout(DEFAULT_TIMEOUT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you seems like we could just do with the single global timeout here, no need for connect and read ?

But we can leave it as is and potentially tweak the inner timeouts later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, no strong opinion here.

@tankyleo
Copy link
Contributor

One aspect that might be debatable here is whether we should drop MaxTotalDelayRetryPolicy given it would interact with the client-side default delay. Hence also pinging @jkczyz who reviewed the original retry PR.

For reference original PR is #20. On first impression, I'd be in favor of the drop myself.

@jkczyz
Copy link

jkczyz commented Aug 20, 2025

One aspect that might be debatable here is whether we should drop MaxTotalDelayRetryPolicy given it would interact with the client-side default delay. Hence also pinging @jkczyz who reviewed the original retry PR.

For reference original PR is #20. On first impression, I'd be in favor of the drop myself.

Do the added timeouts apply to a single operation? If it is never exceeded, wouldn't we still want MaxTotalDelayRetryPolicy to allow limiting retries to a maximum amount of time?

What is meant by client-side default delay?

@tnull
Copy link
Contributor Author

tnull commented Aug 20, 2025

Do the added timeouts apply to a single operation?

Yes, they apply for a single read, but also for connecting / detecting dropped connections AFAIU.

If it is never exceeded, wouldn't we still want MaxTotalDelayRetryPolicy to allow limiting retries to a maximum amount of time?

Yes, it could be useful, but of course its somewhat redundant if we set a client-side timeout and limit the number of retries. It could therefore be a bit confusing if somebody configures the MaxTotalDelayRetryPolicy, but still the other total delay applies if its lesser (i.e., number of retries times timeout).

What is meant by client-side default delay?

Ah, sorry, that was a typo I only corrected in the PR title: should have said default timeout, not delay.

@jkczyz
Copy link

jkczyz commented Aug 20, 2025

Yes, it could be useful, but of course its somewhat redundant if we set a client-side timeout and limit the number of retries.

Do you mean MaxTotalDelayRetryPolicy<MaxAttemptsRetryPolicy<R>>?

It could therefore be a bit confusing if somebody configures the MaxTotalDelayRetryPolicy, but still the other total delay applies if its lesser (i.e., number of retries times timeout).

Isn't this already the case when configured as I mentioned above? Number of attempts takes priority over total delay given the way MaxTotalDelayRetryPolicy is written.

Maybe I'm confused about what is lesser in that example.

@tnull
Copy link
Contributor Author

tnull commented Aug 21, 2025

Do you mean MaxTotalDelayRetryPolicy<MaxAttemptsRetryPolicy<R>>?

Yes, if each client call is also limited by a timeout, then we'd have either timeout*MaxAttemptsRetryPolicy or MaxTotalDelayRetryPolicy being the limiting factor.

Maybe I'm confused about what is lesser in that example.

Say you configure MaxTotalDelayRetryPolicy<MaxAttemptsRetryPolicy<R>> with 5 retries and a total delay of 100 seconds (for the sake of this example). Then you'd expect the client to return either in case of success or after it tried 5 times or after 100s, whatever comes first. Now with the client-side timeouts we also have each retry timeout after 10 seconds, so it might already be done after 50s.

Or, maybe even a bit more confusing would be if the user configured a MaxTotalDelayRetryPolicy of less than 10s, say 5s. They would expect a client call def. return after that 5s. But, if we have a client-side timeout of 10s (or none before this PR), even the first attempt could take way longer than the configured total delay (since we don't use a select but rather a loop, we always await the first/current operation to return).

@ldk-reviews-bot
Copy link

🔔 1st Reminder

Hey @jkczyz! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

@jkczyz
Copy link

jkczyz commented Aug 21, 2025

Say you configure MaxTotalDelayRetryPolicy<MaxAttemptsRetryPolicy<R>> with 5 retries and a total delay of 100 seconds (for the sake of this example). Then you'd expect the client to return either in case of success or after it tried 5 times or after 100s, whatever comes first. Now with the client-side timeouts we also have each retry timeout after 10 seconds, so it might already be done after 50s.

Isn't that expected? "done after 50s" is really "done after 5 attempts".

Or, maybe even a bit more confusing would be if the user configured a MaxTotalDelayRetryPolicy of less than 10s, say 5s. They would expect a client call def. return after that 5s. But, if we have a client-side timeout of 10s (or none before this PR), even the first attempt could take way longer than the configured total delay (since we don't use a select but rather a loop, we always await the first/current operation to return).

Yeah, though isn't that a good argument to use select? Or does the current design not allow that given policy timeout is built into the type rather than the calling site being aware of it?

@tnull
Copy link
Contributor Author

tnull commented Aug 21, 2025

Yeah, though isn't that a good argument to use select? Or does the current design not allow that given policy timeout is built into the type rather than the calling site being aware of it?

True, seems like we should? And given that we already use tokio with the time feature, it should be straightforward. I think I'll add a commit to this PR.

@tnull
Copy link
Contributor Author

tnull commented Aug 22, 2025

Yeah, though isn't that a good argument to use select? Or does the current design not allow that given policy timeout is built into the type rather than the calling site being aware of it?

True, seems like we should? And given that we already use tokio with the time feature, it should be straightforward. I think I'll add a commit to this PR.

Argh, after looking into it for a bit I have to eat my words: it's actually not trivial, as currently RetryPolicy::next_delay requires us to supply the returned error, i.e., we can only calculate the next delay based on the error type (as we use it in FilteredRetryPolicy).

And more generally, with a generic error type, we don't know what error we'd return in case the the timeout happens before the operation future resolves.

@jkczyz
Copy link

jkczyz commented Aug 22, 2025

And more generally, with a generic error type, we don't know what error we'd return in case the the timeout happens before the operation future resolves.

Would it help defining an enum parameterized by the error E where one variant is for a timeout and the other for wrapping E?

@ldk-reviews-bot
Copy link

🔔 2nd Reminder

Hey @jkczyz! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

@ldk-reviews-bot
Copy link

🔔 3rd Reminder

Hey @jkczyz! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

@ldk-reviews-bot
Copy link

🔔 4th Reminder

Hey @jkczyz! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

@ldk-reviews-bot
Copy link

🔔 5th Reminder

Hey @jkczyz! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

@ldk-reviews-bot
Copy link

🔔 6th Reminder

Hey @jkczyz! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

@ldk-reviews-bot
Copy link

🔔 7th Reminder

Hey @jkczyz! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

@ldk-reviews-bot
Copy link

🔔 8th Reminder

Hey @jkczyz! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

@ldk-reviews-bot
Copy link

🔔 9th Reminder

Hey @jkczyz! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

Copy link

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting on @tnull (no rush!). Just want to silence the review bot.

@tnull tnull force-pushed the 2025-08-enable-client-side-delays branch from 5e83c2b to 5670866 Compare November 4, 2025 13:22
@tnull tnull changed the title Enable client-side timeouts Enable client-side timeouts and replace retry logic with reqwest's Nov 4, 2025
@tnull tnull requested review from jkczyz and tankyleo November 4, 2025 13:23
@tnull tnull force-pushed the 2025-08-enable-client-side-delays branch from 5670866 to 4e360cf Compare November 4, 2025 13:24
@tnull
Copy link
Contributor Author

tnull commented Nov 4, 2025

Excuse the delay here. After yet another tokio/vss-client/reqwest odyssey, I now discovered that switching to this branch (or more likely: reqwest >v0.12) got rid of some blocking behavior for ChannelMonitor persistence to VSS. When looking into it, I also realized that reqwest recently shipped retry logic, which (mostly) allows us to remove our homebrewn (and ~broken, as discussed above) retry logic. I therefore now updated this PR to drop crate::utils::retry entirely, and switch to the reqwest-provided logic, making this PR diff net negative (:tada:). The only piece missing is really just to classify whether or not to retry by error type, which we hope they will add at some point (seanmonstar/reqwest#2852), but we should probably be good to ship this as is in the meantime. It def. should be a net improvement.

@tnull tnull force-pushed the 2025-08-enable-client-side-delays branch 2 times, most recently from 9c94320 to 9c3148f Compare November 5, 2025 12:24
@tnull tnull mentioned this pull request Nov 5, 2025
@tnull tnull force-pushed the 2025-08-enable-client-side-delays branch 4 times, most recently from f926914 to 8b8910d Compare November 5, 2025 13:16
@tnull
Copy link
Contributor Author

tnull commented Nov 5, 2025

Now also tacked-on a commit that corrected the expected HTTP status codes in our mocked test service. Relatedly, I discovered that the Rust vss-server didn't use the correct codes either (a regression from the Java version), fixed in lightningdevkit/vss-server#65

@tnull
Copy link
Contributor Author

tnull commented Nov 5, 2025

Now also tacked-on a commit that corrected the expected HTTP status codes in our mocked test service. Relatedly, I discovered that the Rust vss-server didn't use the correct codes either (a regression from the Java version), fixed in lightningdevkit/vss-server#65

.. and finally it turns out that we can drop our RetryPolicy in this PR as, once corrected, we can deal with all the cases through HTTP status codes, no need to parse the payload.

@tnull tnull force-pushed the 2025-08-enable-client-side-delays branch from 6bc7bce to b258cec Compare November 5, 2025 14:04
tnull added 6 commits November 5, 2025 15:22
While the `RetryPolicy` has a `MaxTotalDelayRetryPolicy`, the retry
`loop` would only check this configured delay once the operation future
actually returns a value. However, without client-side timeouts, we're
not super sure the operation is actually guaranteed to return anything
(even an error, IIUC).

So here, we enable some coarse client-side default timeouts to ensure
the polled futures eventualy return either the response *or* an error we
can handle via our retry logic.
.. as some types are part of our API.
We here make use of `reqwest`'s `retry` functionlity that was introduced
with its recent v0.12.23 release. As we can't fully replace the old
'application-level' behavior just yet, we configure it to only apply for
error that arent INTERNAL_SERVER_ERROR, which is the status VSS uses to
send error responses.

We set some default values here, but note that these can always can be
overridden by the user when using the `from_client` constructor.
We drop our (~broken) `RetryPolicy`, and replace it with the
new-ish retry policy that is shipping in `reqwest` since 0.12.23.
@tnull tnull force-pushed the 2025-08-enable-client-side-delays branch from b258cec to e6d8e57 Compare November 5, 2025 14:24
@tnull
Copy link
Contributor Author

tnull commented Nov 5, 2025

Rebased to resolve minor conflicts.

Previously, we'd allow to either re-use a `reqwest::Client` or supply a
header provider. Here we add a new constructor that allows us to do both
at the same time.
@ldk-reviews-bot
Copy link

🔔 1st Reminder

Hey @jkczyz @tankyleo! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

1 similar comment
@ldk-reviews-bot
Copy link

🔔 1st Reminder

Hey @jkczyz @tankyleo! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants