-
Notifications
You must be signed in to change notification settings - Fork 418
Replay lost MonitorEvent
s in some cases for closed channels
#4004
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Replay lost MonitorEvent
s in some cases for closed channels
#4004
Conversation
👋 Thanks for assigning @valentinewallace as a reviewer! |
658c408
to
dc20515
Compare
One of the new tests is failing |
dc20515
to
e076dd5
Compare
Oops, silent rebase conflict due to #4001. |
Test is still failing |
e076dd5
to
86a58b6
Compare
Grr, the #4001 changes were random so had to fix all the paths, should be fine now. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4004 +/- ##
==========================================
+ Coverage 88.85% 88.86% +0.01%
==========================================
Files 175 175
Lines 127710 128093 +383
Branches 127710 128093 +383
==========================================
+ Hits 113478 113835 +357
- Misses 11675 11692 +17
- Partials 2557 2566 +9
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
for (channel_id, monitor) in args.channel_monitors.iter() { | ||
let mut is_channel_closed = false; | ||
let counterparty_node_id = monitor.get_counterparty_node_id(); | ||
if let Some(peer_state_mtx) = per_peer_state.get(&counterparty_node_id) { | ||
let mut peer_state_lock = peer_state_mtx.lock().unwrap(); | ||
let peer_state = &mut *peer_state_lock; | ||
is_channel_closed = !peer_state.channel_by_id.contains_key(channel_id); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not seeing the bug that this is fixing -- if a payment is failed after being fulfilled in outbound_payments
, we'll terminate early after noticing it's fulfilled already in OutboundPayments::fail_htlc
Context from the commit message: "... we could get both a PaymentFailed
and a PaymentClaimed
event on startup for the same payment"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets say we send an MPP payment along two different channels. One is claimed and the other failed (who knows why, the recipient just decided they don't like money for whatever reason).
Going through the single loop we may first find the failed-htlc channel - we'll add the pending payment in insert_from_monitor_on_startup
which will just add the one part, then we'll see it failed and generate a PaymentFailed
event since there's only one part. Then we'll go to the second channel and repeat the same process, but now with a PaymentSent
event.
for (channel_id, monitor) in args.channel_monitors.iter() { | ||
let mut is_channel_closed = false; | ||
let counterparty_node_id = monitor.get_counterparty_node_id(); | ||
if let Some(peer_state_mtx) = per_peer_state.get(&counterparty_node_id) { | ||
let mut peer_state_lock = peer_state_mtx.lock().unwrap(); | ||
let peer_state = &mut *peer_state_lock; | ||
is_channel_closed = !peer_state.channel_by_id.contains_key(channel_id); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would appreciate some more context on this part of the commit message: "this can lead to a pending payment getting re-added and re-claimed multiple times" (phrased as bad).
It looks like in the prior code and this new code, we'll call OutboundPayments::insert_from_monitor_on_startup
for each session_priv
(seems fine), and then call claim_htlc
for each session_priv
(seems fine since this will only generate a PaymentSent
event on the first call to claim_htlc
). Can you help me understand what I'm missing that was buggy in the prior code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rewrote the commit message.
($holder_commitment: expr, $htlc_iter: expr) => { | ||
for (htlc, source) in $htlc_iter { | ||
let filter = |v: &&IrrevocablyResolvedHTLC| { | ||
v.commitment_tx_output_idx == htlc.transaction_output_index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output index can be None
on both sides currently, is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhhh, it wasn't, but I believe it is almost the correct behavior - if we have a dust HTLC that never made it to chain and the commitment is confirmed, we want to consider it resolved. I'll tweak it to make it clearer and more likely correct.
}; | ||
if let Some(state) = us.htlcs_resolved_on_chain.iter().filter(filter).next() { | ||
if let Some(source) = source { | ||
if state.payment_preimage.is_none() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests pass when removing this check atm. Probably fine if the test is too annoying to write though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe its unobservable. There are several cases where we'd hit a theoretical else
block, but the ChannelManager
read logic actually replays preimages first, so we'd always actually replay the preimage first, then get here, spuriously include it, and not have anything to fail.
👋 The first review has been submitted! Do you think this PR is ready for a second reviewer? If so, click here to assign a second reviewer. |
86a58b6
to
87ae44d
Compare
✅ Added second reviewer: @joostjager |
{ | ||
walk_htlcs!( | ||
false, | ||
us.funding.counterparty_claimable_outpoints.get(&txid).unwrap().iter().filter_map( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential panic due to unwrap() on None value. The code calls us.funding.counterparty_claimable_outpoints.get(&txid).unwrap()
but there's no guarantee that the txid exists in the map. If the txid is not found, this will panic at runtime. Should use if let Some(outpoints) = us.funding.counterparty_claimable_outpoints.get(&txid)
or similar safe pattern instead of unwrap().
us.funding.counterparty_claimable_outpoints.get(&txid).unwrap().iter().filter_map( | |
us.funding.counterparty_claimable_outpoints.get(&txid).unwrap_or(&Vec::new()).iter().filter_map( |
Spotted by Diamond
Is this helpful? React 👍 or 👎 to let us know.
"Failing HTLC with payment hash {} as it was resolved on-chain.", | ||
htlc.payment_hash | ||
); | ||
failed_htlcs.push(( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we generate a PaymentFailed
and/or HTLCHandlingFailed
event for these HTLCs on every restart until the monitor is removed? Could be worth noting in a comment if so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fixed in #3984 :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM after squash
} | ||
|
||
let txid = confirmed_txid.unwrap(); | ||
if Some(txid) == us.funding.current_counterparty_commitment_txid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this need to consider the current/prev txids for the pending_funding
scopes as well if we have alternative_funding_confirmed
set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does, when it was written that PR hadn't landed yet :)
842a018
to
8ae6661
Compare
`test_dup_htlc_onchain_doesnt_fail_on_reload` made reference to `ChainMonitor` persisting `ChannelMonitor`s on each new block, which hasn't been the case in some time. Instead, we update the comment and code to make explicit that it doesn't impact the test.
During testsing, we check that a `ChannelMonitor` will round-trip through serialization exactly. However, we recently added a fix to change a value in `PackageTemplate` on reload to fix some issues in the field in 0.1. This can cause the round-trip tests to fail as a field is modified during read. We fix it here by simply exempting the field from the equality test in the condition where it would be updated on read. We also make the `ChannelMonitor` `PartialEq` trait implementation non-public as weird workarounds like this make clear that such a comparison is a britle API at best.
On `ChannelManager` reload we rebuild the pending outbound payments list by looking for any missing payments in `ChannelMonitor`s. However, in the same loop over `ChannelMonitor`s, we also re-claim any pending payments which we see we have a payment preimage for. If we send an MPP payment across different chanels, the result may be that we'll iterate the loop, and in each iteration add a pending payment with only one known path, then claim/fail it and remove the pending apyment (at least for the claim case). This may result in spurious extra events, or even both a `PaymentFailed` and `PaymentSent` event on startup for the same payment.
`MonitorEvent`s aren't delivered to the `ChannelManager` in a durable fashion - if the `ChannelManager` fetches the pending `MonitorEvent`s, then the `ChannelMonitor` gets persisted (i.e. due to a block update) then the node crashes, prior to persisting the `ChannelManager` again, the `MonitorEvent` and its effects on the `ChannelManger` will be lost. This isn't likely in a sync persist environment, but in an async one this could be an issue. Note that this is only an issue for closed channels - `MonitorEvent`s only inform the `ChannelManager` that a channel is closed (which the `ChannelManager` will learn on startup or when it next tries to advance the channel state), that `ChannelMonitorUpdate` writes completed (which the `ChannelManager` will detect on startup), or that HTLCs resolved on-chain post closure. Of the three, only the last is problematic to lose prior to a reload. When we restart and, during `ChannelManager` load, see a `ChannelMonitor` for a closed channel, we scan it for preimages that we passed to it and re-apply those to any pending or forwarded payments. However, we didn't scan it for preimages it learned from transactions on-chain. In cases where a `MonitorEvent` is lost, this can lead to a lost preimage. Here we fix it by simply tracking preimages we learned on-chain the same way we track preimages picked up during normal channel operation.
`MonitorEvent`s aren't delivered to the `ChannelManager` in a durable fashion - if the `ChannelManager` fetches the pending `MonitorEvent`s, then the `ChannelMonitor` gets persisted (i.e. due to a block update) then the node crashes, prior to persisting the `ChannelManager` again, the `MonitorEvent` and its effects on the `ChannelManger` will be lost. This isn't likely in a sync persist environment, but in an async one this could be an issue. Note that this is only an issue for closed channels - `MonitorEvent`s only inform the `ChannelManager` that a channel is closed (which the `ChannelManager` will learn on startup or when it next tries to advance the channel state), that `ChannelMonitorUpdate` writes completed (which the `ChannelManager` will detect on startup), or that HTLCs resolved on-chain post closure. Of the three, only the last is problematic to lose prior to a reload. In a previous commit we handled the case of claimed HTLCs by replaying payment preimages on startup to avoid `MonitorEvent` loss causing us to miss an HTLC claim. Here we handle the HTLC-failed case similarly. Unlike with HTLC claims via preimage, we don't already have replay logic in `ChannelManager` startup, but its easy enough to add one. Luckily, we already track when an HTLC reaches permanently-failed state in `ChannelMonitor` (i.e. it has `ANTI_REORG_DELAY` confirmations on-chain on the failing transaction), so all we need to do is add the ability to query for that and fail them on `ChannelManager` startup.
8ae6661
to
69b281b
Compare
Rebased to update against the new funding logic in ChannelMonitor. |
This is the first few commits from #3984 which handles some of the trivial cases, but skips actually marking
MonitorEvent
s completed which is needed for some further edge-cases.