-
Notifications
You must be signed in to change notification settings - Fork 1.1k
fix(kad): enforce a timeout for inbound substreams #6009
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
925b8ac
to
36ce9b5
Compare
This comment was marked as resolved.
This comment was marked as resolved.
Thank you for the PR @teor2345! For outbound substreams, the timeout is implemented by using the That would also match the implementation in other protocols, as you already stated in #5981:
|
Unfortunately not (or not without a larger refactor). rust-libp2p/protocols/kad/src/handler.rs Lines 906 to 907 in 5377558
It would be possible to hold the streams in a But we can't iterate through a rust-libp2p/protocols/kad/src/handler.rs Lines 557 to 582 in 5377558
So the only solution I could find is to add a timeout field to some of the inbound substream states. ( The underlying issue is that the code couples the state of the substream with the stream of items from it. A refactor could put Is this a change that would be acceptable in a bug fix? Particularly one that other users might want backported? |
Could we not use EDIT: Or maybe poll the stream in the future for the set? |
Sure, but that doesn't deal with substream reuse, because the There are two pieces of functionality that this code needs:
Here's one possible way to implement that:
This will involve some quite weird types, like Is this a change that would be acceptable in a bug fix? Particularly one that other users might want backported? I also can't guarantee I'll have time to work on this any time soon, because the current PR code works, and fixes our downstream bug. |
Thank you for the follow-ups and detailed explanation @teor2345.
Opened thomaseizinger/rust-futures-bounded#8 to see if we can implement iterator for |
#6015 made the timeout for outbound substreams in kademlia configurable by introducing a `outbound_substreams_timeout` config option. There is currently an open PR #6009 that plans to also apply that timeout to inbound substream, in which case we probably want toe rename the config option to `substreams_timeout` (without the prefix). Because we plan to release soon and might not get #6009 in in-time, I propose that we already do the renaming now, to avoid breaking the API again with the next release. Pull-Request: #6076.
friendly ping @teor2345, I saw thomaseizinger/rust-futures-bounded#10 (comment) but just to confirm, do you plan on continuing this? |
I think that thomaseizinger/rust-futures-bounded#10 will be a better alternative once it's released. |
This pull request has merge conflicts. Could you please resolve them @teor2345? 🙏 |
5377558
to
5fd5603
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jxs @elenaf9 this is now ready for review again!
Most of the changes are for the futures-bounded
0.3.0 upgrade, happy to split them out into a separate PR if that would help.
I've marked it as draft because the kad
/StreamSet
changes depend on an unreleased version of futures-bounded
(they are waiting for us to test these changes).
I can test this branch with subspace, but are there any other tests you'd like to do?
This pull request has merge conflicts. Could you please resolve them @teor2345? 🙏
It really doesn't 🤣
) | ||
}) | ||
{ | ||
*s.get_mut() = new_substream; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because StreamSet
only allows try_push()
, we can't go over the stream limit, even temporarily. So the new stream has to replace the old one immediately. This also means we don't need the Canceled
state.
This code is more robust, because old code could have subtle bugs. For example, if we added two streams over the limit in a row (so we had 33 streams), the equality check (== 32
) would fail, and allow us to add even more streams over the 32 limit.
if self.inbound_substreams.len() >= MAX_NUM_STREAMS { | ||
if let Some(s) = self | ||
.inbound_substreams | ||
.iter_mut_of_type::<InboundSubstreamState>() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This weird construct is required because the underlying API (in SelectAll
) doesn't allow typed iteration.
if let Poll::Ready(Some(event)) = self.inbound_substreams.poll_next_unpin(cx) { | ||
return Poll::Ready(event); | ||
if let Poll::Ready(Some(event_result)) = self.inbound_substreams.poll_next_unpin(cx) { | ||
match event_result { | ||
Ok(event) => return Poll::Ready(event), | ||
Err(_stream_set_timeout) => { | ||
tracing::trace!( | ||
"Inbound substream timed out waiting for peer, send, or close" | ||
); | ||
continue; | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure what to do with an inbound substream timeout.
We could:
- explicitly close the substream
- return some synthetic inbound timeout event
- something else???
Description
In the
kad
substream handler, outbound substreams have a 10s timeout, but inbound substreams don't have any timeout.This results in large numbers of warnings under specific heavy load conditions, which we have encountered in subspace:
After this fix, if a substream times out, it is dropped from the set. The existing reusable substream behaviour is preserved: idle substreams are replaced when the substream limit is reached.
Fixes #5981
Cleanups
The substreams are pinned during iteration, so we can't remove them from the set. But we can replace them with a new substream. Since reusable streams are replaced directly, the
Canceled
state is no longer needed.This PR also includes an upgrade to
futures-bounded
0.3.0, in a separate commit.Notes & open questions
Should the substream be closed on a timeout?
The existing code doesn't close them on (most) substream errors, so this PR handles timeouts the same way, by dropping the substream without closing it.
I have been unable to replicate the specific conditions leading to this bug in a test or my local subspace node, but we've confirmed the fix works on multiple nodes in the subspace network.
Change checklist