-
Notifications
You must be signed in to change notification settings - Fork 114
Backports for 0.6.2 #613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backports for 0.6.2 #613
Conversation
Previously, we had to configure enormous syncing timeouts as the BDK wallet syncing would hold a central mutex that could lead to large parts of event handling and syncing locking up. Here, we drop the configured timeouts considerably across the board, since such huge values are hopefully not required anymore.
Previously, we used to a channel to indicate that the background processor task has been stopped. Here, we rather just await the task's `JoinHandle` which is more robust in that it avoids a race condition.
.. we provide finer-grained logging after each step of `stop`.
👋 Thanks for assigning @TheBlueMatt as a reviewer! |
Now also included #590 to fix the CLN CI on this branch, too. |
loop { | ||
let timeout_fut = tokio::time::timeout( | ||
Duration::from_secs(BACKGROUND_TASK_SHUTDOWN_TIMEOUT_SECS), | ||
tasks.join_next_with_id(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than letting each of these just take their time, IMO we should have a JoinSet
for things that we can cancel right away and the ones we have to join. That way we can just call abort_all
for the tasks that we really don't want to wait around on for no reason.
Going through the callers we have
continuously_sync_wallets
- not clear that this is cancel-safe? I assume not cause its all BDK stuff and they don't guarantee that?- RGS sync - clearly can just be cancelled. There's no need to wait for a timeout
- inbound TCP acceptor - clearly can just be cancelled, though it should reliably exit fast
- reconnect loop - clearly can just be cancelled, and it might be important because the connect logic loops until the peer finishes its handshake
- node announce - clearly we can just cancel.
- liquidity handler - probably not cancel-safe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have a stop signal for all of these, which should be very similar to aborting for most of the cited cases, as ~all of them should be selecting ~instantly on the signal receiver.
Note that most of them also perform IO, if it's just to update NodeMetrics
after they're done (e.g. RGS, node ann.). So, for one some of these tasks might simply not be cancellable (as you can't cancel IOPS/spawn_blocking), but I think it's cleaner/easier to reason about if we simply let them reach the next await
point.
If you find it much preferable I can add a separate set for the TCP tasks, but not entirely convinced it's worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have a stop signal for all of these, which should be very similar to aborting for most of the cited cases, as ~all of them should be selecting ~instantly on the signal receiver.
This is at least not true for the background connection task. It has a sleep in the loop but it isn't selected with the exit task. If it happens to be connecting that guarantees we'll wait like a few RTTs to exit, but if the peer is hung it could be a lot longer. Rather than trying to pipe the exit signal through everywhere it seems much simpler to just cancel the tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, now split the TCP tasks out.
edd22b2
to
dbe11e8
Compare
src/lib.rs
Outdated
@@ -661,6 +684,53 @@ impl Node { | |||
self.chain_source.stop(); | |||
log_debug!(self.logger, "Stopped chain sources."); | |||
|
|||
// Cancel cancellable background tasks | |||
if let Some(mut tasks) = self.cancellable_background_tasks.lock().unwrap().take() { | |||
tasks.abort_all(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still want to join
them even after we abort
, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because they may have been in sync code at the time we called abort_all and we want to wait until they reach an await point so that we know they're actually not running. Its not all that critical, but in theory the connection logic is still race-y I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, now doing so.
dbe11e8
to
a0be1a5
Compare
src/lib.rs
Outdated
log_debug!(self.logger, "Stopped chain sources."); | ||
|
||
// Cancel cancellable background tasks | ||
if let Some(mut tasks) = self.cancellable_background_tasks.lock().unwrap().take() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ALso doesn't this need to happen after disconnect_all_peers
, which is a few lines up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean before? For the TCP tasks yes. Moved up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, yes, thanks.
Previously, we'd only wait for the background processor tasks to successfully finish. It turned out that this could lead to races when the other background tasks took too long to shutdown. Here, we attempt to wait on all background tasks shutting down for a bit, before moving on.
.. as tokio tends to panic if dropping a runtime in an async context and we're not super careful. Here, we add some test coverage for this edge case in Rust tests.
a0be1a5
to
1a2d900
Compare
These are backports of #592 and #612 for the upcoming 0.6.2 release.