-
Notifications
You must be signed in to change notification settings - Fork 420
Persist ChannelMonitors after new blocks are connected #1108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persist ChannelMonitors after new blocks are connected #1108
Conversation
| /// Failures here do not imply the channel will be force-closed, however any future calls to | ||
| /// [`update_persisted_channel`] after an error is returned here MUST either persist the full, | ||
| /// updated [`ChannelMonitor`] provided to [`update_persisted_channel`] or return | ||
| /// [`ChannelMonitorUpdateErr::PermanentFailure`], force-closing the channel. In other words, | ||
| /// any future calls to [`update_persisted_channel`] after an error here MUST NOT persist the | ||
| /// [`ChannelMonitorUpdate`] alone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we simplify this by making update_persisted_channel take an Option<&ChannelMonitorUpdate> and forgoing adding sync_persisted_channel method to the trait? Then when the user sees None they must write the full ChannelMonitor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return value distinction is the harder part to capture - users don't really have the option to return a PermanentFailure here as the call site (the ChainMonitor) doesn't have the ability to force-close a channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think I understand the subtly here. But shouldn't we not call update_persisted_channel once sync_persisted_channel returns an error rather than rely on the user to implement update_persisted_channel correctly? Or at very least never pass them a ChannelMonitorUpdate there (making it an Option) even if we still need the extra method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, that's a good point, I guess I hadn't thought about making the update an option and skipping it if we've marked a channel as "previous sync didn't persist". The "don't call update_persisted_channel" change, though feels wrong - technically if the user complies with the API docs we can continue operating fine, after a restart, but if we don't call update_persisted_channel at all we have to force-close the channel immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if after a block comes in, we set a bool in ChannelMonitor indicating there’s been a new block? Then we can change update_persisted_channel to pass in whether a full sync is needed to avoid a big chain resync.
Can also check the new bool in timer_tick and call update_persisted_channel then, if we also make the ChannelMonitorUpdate an Option maybe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need to sync the monitor before we can hand any chain-generated events back to the ChannelManager, so I don't think we can avoid a sync at all.
Codecov Report
@@ Coverage Diff @@
## main #1108 +/- ##
==========================================
- Coverage 90.58% 90.40% -0.19%
==========================================
Files 66 67 +1
Lines 34459 34658 +199
==========================================
+ Hits 31215 31332 +117
- Misses 3244 3326 +82
Continue to review full report at Codecov.
|
29c7f1e to
60a8a70
Compare
|
Also shoved in two commtis to move |
322bc11 to
bde998c
Compare
| /// Failures here do not imply the channel will be force-closed, however any future calls to | ||
| /// [`update_persisted_channel`] after an error is returned here MUST either persist the full, | ||
| /// updated [`ChannelMonitor`] provided to [`update_persisted_channel`] or return | ||
| /// [`ChannelMonitorUpdateErr::PermanentFailure`], force-closing the channel. In other words, | ||
| /// any future calls to [`update_persisted_channel`] after an error here MUST NOT persist the | ||
| /// [`ChannelMonitorUpdate`] alone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think I understand the subtly here. But shouldn't we not call update_persisted_channel once sync_persisted_channel returns an error rather than rely on the user to implement update_persisted_channel correctly? Or at very least never pass them a ChannelMonitorUpdate there (making it an Option) even if we still need the extra method.
lightning/src/chain/chainmonitor.rs
Outdated
| let mut ev_lock = self.event_mutex.lock().unwrap(); | ||
| txn_outputs = process(monitor, txdata); | ||
| log_trace!(self.logger, "Syncing Channel Monitor for channel {}", log_funding_info!(monitor)); | ||
| if let Err(()) = self.persister.sync_persisted_channel(*funding_outpoint, monitor) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have plans to aggregate updates across multiple blocks (in case of a restart)? Or maybe that is also problematic if offline for awhile and wouldn't persist until the end of the sync. 😕
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're "just fine" here, somewhat due to an API quirk. The on-load replay is supposed to be per-channelmonitor, and not via the chain::Watch/ChainMonitor, or at least we recommend it. Technically I think a user could use ChainMontior during that time, but at least we dont?
|
Re: the first commit, would it be possible to separate out the fix for the redundant chain sync from the fix for the duplicate PaymentSent scenario? Seems that might ease review |
bde998c to
96d9895
Compare
|
Making this a draft for now, I think it'll end up depending on a new test-refactor/ChainMonitor-api-refactor pr. |
96d9895 to
66cc962
Compare
|
This is now based on #1112 and should have largely addressed the feedback. |
374e622 to
d3f0772
Compare
d3f0772 to
33c37f0
Compare
895d4ce to
f73999d
Compare
|
Tested as a part of #1130. |
4725103 to
f0aa603
Compare
| monitor_state.last_chain_persist_height.load(Ordering::Acquire) + LATENCY_GRACE_PERIOD_BLOCKS as usize | ||
| > self.highest_chain_height.load(Ordering::Acquire) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can there be an edge case around reorgs here where highest_chain_height is reduced while blocks are disconnected, release_pending_monitor_events is called, and then the new blocks are connected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yea, was thinking it wasn't worth worrying about, but its easy enough to just fetch_max, so I did that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the testing! Almost ready to sign off
| entry.insert(MonitorHolder { | ||
| monitor, | ||
| pending_monitor_updates: Mutex::new(pending_monitor_updates), | ||
| channel_perm_failed: AtomicBool::new(false), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we be a bit smarter and check monitor.locked_from_offchain? Or, is there any way to inquire the monitor rather than defaulting to false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we care? If we get locked_from_offchain that means ChannelManager told us the channel is closed, so we're probably not gonna get into watch_channel at that point. The only case where it can be perm-failed at this point is if we failed to persist above, but then we'd have returned early already.
We also take this opportunity to drop byte_utils::le64_to_array, as our MSRV now supports the native to_le_bytes() call.
|
Squashed all fixups except for new ones from the latest round of feedback. |
3eb7879 to
2a22a16
Compare
lightning/src/chain/chainmonitor.rs
Outdated
| let mut old_height = self.highest_chain_height.load(Ordering::Relaxed); | ||
| while self.highest_chain_height | ||
| .compare_exchange(old_height, height as usize, Ordering::AcqRel, Ordering::Relaxed).is_err() | ||
| { | ||
| old_height = self.highest_chain_height.load(Ordering::Acquire); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So taking a closer look, we actually never call process_chain_data when blocks are disconnected, so highest_chain_height won't be reduced. I suppose a user could call best_block_updated with a smaller height, though, and there's some logic in ChannelMonitor to handle this case.
But when given a smaller height, won't this not be equivalent to fetch_max?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe its still an issue in 2-block reorgs - you may disconnect two blocks and then connect one and see a height that is one lower than the target.
And, oops, yes, added the requisite compare :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8f47cf8 to
c1c5462
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be pretty much good to go, a couple comments outstanding.
lightning/src/chain/chainmonitor.rs
Outdated
| /// it is up to you to maintain a correct mapping between the outpoint and the | ||
| /// stored channel data). Note that you **must** persist every new monitor to | ||
| /// disk. See the `Persist` trait documentation for more details. | ||
| /// Persist a new channel's data. The data can be stored any way you want, but the identifier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanted to circle back on this. We could possibly rename persist_new_channel to persist_channel and pass it enum stating the type of persistence: new channel, re-persistence, etc. Then get rid of the Option in the update_persisted_channel parameter. But that may involve piping the enum through chain::Watch? I don't feel strongly about it, but the docs may need to be clarified here.
lightning/src/chain/chainmonitor.rs
Outdated
| /// it is up to you to maintain a correct mapping between the outpoint and the | ||
| /// stored channel data). Note that you **must** persist every new monitor to | ||
| /// disk. See the `Persist` trait documentation for more details. | ||
| /// Persist a new channel's data. The data can be stored any way you want, but the identifier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, my concern is also that the
chain::Watchshould/should be able to reject new channelmonitors for the same outpoint twice. I believe we actually rely on this behavior somewhat to avoid some attacks. If we have an enum that controls whether the user is supposed to return an error or not, I'm very worried they just won't do it.
Good point. Happy to leave it as is.
Which docs do you think need to be updated?
Ah, just that this is not necessarily a new channel since it may be called by watch_channel upon restart as per ChannelManagerReadArgs' docs.
lightning/src/chain/chainmonitor.rs
Outdated
| let mut old_height = self.highest_chain_height.load(Ordering::Relaxed); | ||
| let new_height = height as usize; | ||
| while new_height > old_height && self.highest_chain_height | ||
| .compare_exchange(old_height, new_height, Ordering::AcqRel, Ordering::Relaxed).is_err() | ||
| { | ||
| old_height = self.highest_chain_height.load(Ordering::Acquire); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd find this simpler/more readable:
let old_height = ..
let new_height = ..
if new_height > old_height {
self.highest_chain_height.store(new_height)
}
maybe I'm missing something here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duh, yes, can do that as long as its under the lock. I've moved the code down and simplified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 240fc03
240fc03 to
a65686c
Compare
|
Squashed without changes, diff from Val's ACK (will land after CI): |
In the next commit we'll need ChainMonitor to "see" when a monitor persistence completes, which means `monitor_updated` needs to move to `ChainMonitor`. The simplest way to then communicate that information to `ChannelManager` is via `MonitorEvet`s, which seems to line up ok, even if they're now constructed by multiple different places.
In the next commit, we'll be originating monitor updates both from the ChainMonitor and from the ChannelManager, making simple sequential update IDs impossible. Further, the existing async monitor update API was somewhat hard to work with - instead of being able to generate monitor_updated callbacks whenever a persistence process finishes, you had to ensure you only did so at least once all previous updates had also been persisted. Here we eat the complexity for the user by moving to an opaque type for monitor updates, tracking which updates are in-flight for the user and only generating monitor-persisted events once all pending updates have been committed.
a65686c to
a3d07e6
Compare
|
Oops, added missing doclinks and dropped a link in private docs that needed linking, will land after CI: |
This resolves several user complaints (and issues in the sample node) where startup is substantially delayed as we're always waiting for the chain data to sync. Further, in an upcoming PR, we'll be reloading pending payments from ChannelMonitors on restart, at which point we'll need the change here which avoids handling events until after the user has confirmed the `ChannelMonitor` has been persisted to disk. It will avoid a race where we * send a payment/HTLC (persisting the monitor to disk with the HTLC pending), * force-close the channel, removing the channel entry from the ChannelManager entirely, * persist the ChannelManager, * connect a block which contains a fulfill of the HTLC, generating a claim event, * handle the claim event while the `ChannelMonitor` is being persisted, * persist the ChannelManager (before the CHannelMonitor is persisted fully), * restart, reloading the HTLC as a pending payment in the ChannelManager, which now has no references to it except from the ChannelMonitor which still has the pending HTLC, * replay the block connection, generating a duplicate PaymentSent event.
ChannelMonitors now require that they be re-persisted before MonitorEvents be provided to the ChannelManager, the exact thing that test_dup_htlc_onchain_fails_on_reload was testing for when it *didn't* happen. As such, test_dup_htlc_onchain_fails_on_reload is now testing that we bahve correctly when the API guarantees are not met, something we don't need to do. Here, we adapt it to test the new API requirements through ChainMonitor's calls to the Persist trait instead.
If we have a `ChannelMonitor` update from an on-chain event which returns a `TemporaryFailure`, we block `MonitorEvent`s from that `ChannelMonitor` until the update is persisted. This prevents duplicate payment send events to the user after payments get reloaded from monitors on restart. However, if the event being avoided isn't going to generate a PaymentSent, but instead result in us claiming an HTLC from an upstream channel (ie the HTLC was forwarded), then the result of a user delaying the event is that we delay getting our money, not a duplicate event. Because user persistence may take an arbitrary amount of time, we need to bound the amount of time we can possibly wait to return events, which we do here by bounding it to 3 blocks. Thanks to Val for catching this in review.
Its somewhat confusing that `persist_new_channel` is called on startup for an existing channel in common deployments, so we call it out explicitly.
a3d07e6 to
6fb5bd3
Compare
This resolves several user complaints (and issues in the sample
node) where startup is substantially delayed as we're always
waiting for the chain data to sync.
Further, in an upcoming PR, we'll be reloading pending payments
from ChannelMonitors on restart, at which point we'll need the
change here which avoids handling events until after the user
has confirmed the
ChannelMonitorhas been persisted to disk.It will avoid a race where we
HTLC pending),
ChannelManager entirely,
a claim event,
ChannelMonitoris beingpersisted,
persisted fully),
ChannelManager, which now has no references to it except from
the ChannelMonitor which still has the pending HTLC,
event.