Skip to content

Replay lost MonitorEvents in some cases for closed channels #4004

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

TheBlueMatt
Copy link
Collaborator

This is the first few commits from #3984 which handles some of the trivial cases, but skips actually marking MonitorEvents completed which is needed for some further edge-cases.

@TheBlueMatt TheBlueMatt added backport 0.1 weekly goal Someone wants to land this this week labels Aug 11, 2025
@ldk-reviews-bot
Copy link

ldk-reviews-bot commented Aug 11, 2025

👋 Thanks for assigning @valentinewallace as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

@TheBlueMatt TheBlueMatt force-pushed the 2025-07-mon-event-failures-1 branch from 658c408 to dc20515 Compare August 11, 2025 20:18
@valentinewallace
Copy link
Contributor

One of the new tests is failing

@TheBlueMatt TheBlueMatt force-pushed the 2025-07-mon-event-failures-1 branch from dc20515 to e076dd5 Compare August 12, 2025 01:46
@TheBlueMatt
Copy link
Collaborator Author

Oops, silent rebase conflict due to #4001.

@valentinewallace
Copy link
Contributor

Test is still failing

@TheBlueMatt TheBlueMatt force-pushed the 2025-07-mon-event-failures-1 branch from e076dd5 to 86a58b6 Compare August 12, 2025 14:46
@TheBlueMatt
Copy link
Collaborator Author

Grr, the #4001 changes were random so had to fix all the paths, should be fine now.

Copy link

codecov bot commented Aug 12, 2025

Codecov Report

❌ Patch coverage is 93.43434% with 26 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.86%. Comparing base (ac8f897) to head (69b281b).

Files with missing lines Patch % Lines
lightning/src/ln/monitor_tests.rs 95.14% 6 Missing and 6 partials ⚠️
lightning/src/chain/package.rs 70.37% 8 Missing ⚠️
lightning/src/ln/channelmanager.rs 88.23% 3 Missing and 1 partial ⚠️
lightning/src/chain/channelmonitor.rs 97.70% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4004      +/-   ##
==========================================
+ Coverage   88.85%   88.86%   +0.01%     
==========================================
  Files         175      175              
  Lines      127710   128093     +383     
  Branches   127710   128093     +383     
==========================================
+ Hits       113478   113835     +357     
- Misses      11675    11692      +17     
- Partials     2557     2566       +9     
Flag Coverage Δ
fuzzing 21.82% <0.00%> (-0.05%) ⬇️
tests 88.70% <93.43%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment on lines +16536 to +16444
for (channel_id, monitor) in args.channel_monitors.iter() {
let mut is_channel_closed = false;
let counterparty_node_id = monitor.get_counterparty_node_id();
if let Some(peer_state_mtx) = per_peer_state.get(&counterparty_node_id) {
let mut peer_state_lock = peer_state_mtx.lock().unwrap();
let peer_state = &mut *peer_state_lock;
is_channel_closed = !peer_state.channel_by_id.contains_key(channel_id);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing the bug that this is fixing -- if a payment is failed after being fulfilled in outbound_payments, we'll terminate early after noticing it's fulfilled already in OutboundPayments::fail_htlc

Context from the commit message: "... we could get both a PaymentFailed and a PaymentClaimed event on startup for the same payment"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets say we send an MPP payment along two different channels. One is claimed and the other failed (who knows why, the recipient just decided they don't like money for whatever reason).

Going through the single loop we may first find the failed-htlc channel - we'll add the pending payment in insert_from_monitor_on_startup which will just add the one part, then we'll see it failed and generate a PaymentFailed event since there's only one part. Then we'll go to the second channel and repeat the same process, but now with a PaymentSent event.

Comment on lines +16536 to +16444
for (channel_id, monitor) in args.channel_monitors.iter() {
let mut is_channel_closed = false;
let counterparty_node_id = monitor.get_counterparty_node_id();
if let Some(peer_state_mtx) = per_peer_state.get(&counterparty_node_id) {
let mut peer_state_lock = peer_state_mtx.lock().unwrap();
let peer_state = &mut *peer_state_lock;
is_channel_closed = !peer_state.channel_by_id.contains_key(channel_id);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would appreciate some more context on this part of the commit message: "this can lead to a pending payment getting re-added and re-claimed multiple times" (phrased as bad).

It looks like in the prior code and this new code, we'll call OutboundPayments::insert_from_monitor_on_startup for each session_priv (seems fine), and then call claim_htlc for each session_priv (seems fine since this will only generate a PaymentSent event on the first call to claim_htlc). Can you help me understand what I'm missing that was buggy in the prior code?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote the commit message.

($holder_commitment: expr, $htlc_iter: expr) => {
for (htlc, source) in $htlc_iter {
let filter = |v: &&IrrevocablyResolvedHTLC| {
v.commitment_tx_output_idx == htlc.transaction_output_index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output index can be None on both sides currently, is that intentional?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhhh, it wasn't, but I believe it is almost the correct behavior - if we have a dust HTLC that never made it to chain and the commitment is confirmed, we want to consider it resolved. I'll tweak it to make it clearer and more likely correct.

};
if let Some(state) = us.htlcs_resolved_on_chain.iter().filter(filter).next() {
if let Some(source) = source {
if state.payment_preimage.is_none() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests pass when removing this check atm. Probably fine if the test is too annoying to write though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe its unobservable. There are several cases where we'd hit a theoretical else block, but the ChannelManager read logic actually replays preimages first, so we'd always actually replay the preimage first, then get here, spuriously include it, and not have anything to fail.

@ldk-reviews-bot
Copy link

👋 The first review has been submitted!

Do you think this PR is ready for a second reviewer? If so, click here to assign a second reviewer.

@TheBlueMatt TheBlueMatt force-pushed the 2025-07-mon-event-failures-1 branch from 86a58b6 to 87ae44d Compare August 12, 2025 20:48
@ldk-reviews-bot
Copy link

✅ Added second reviewer: @joostjager

{
walk_htlcs!(
false,
us.funding.counterparty_claimable_outpoints.get(&txid).unwrap().iter().filter_map(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential panic due to unwrap() on None value. The code calls us.funding.counterparty_claimable_outpoints.get(&txid).unwrap() but there's no guarantee that the txid exists in the map. If the txid is not found, this will panic at runtime. Should use if let Some(outpoints) = us.funding.counterparty_claimable_outpoints.get(&txid) or similar safe pattern instead of unwrap().

Suggested change
us.funding.counterparty_claimable_outpoints.get(&txid).unwrap().iter().filter_map(
us.funding.counterparty_claimable_outpoints.get(&txid).unwrap_or(&Vec::new()).iter().filter_map(

Spotted by Diamond

Is this helpful? React 👍 or 👎 to let us know.

"Failing HTLC with payment hash {} as it was resolved on-chain.",
htlc.payment_hash
);
failed_htlcs.push((
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we generate a PaymentFailed and/or HTLCHandlingFailed event for these HTLCs on every restart until the monitor is removed? Could be worth noting in a comment if so

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fixed in #3984 :)

Copy link
Contributor

@valentinewallace valentinewallace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after squash

}

let txid = confirmed_txid.unwrap();
if Some(txid) == us.funding.current_counterparty_commitment_txid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this need to consider the current/prev txids for the pending_funding scopes as well if we have alternative_funding_confirmed set?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does, when it was written that PR hadn't landed yet :)

@TheBlueMatt TheBlueMatt force-pushed the 2025-07-mon-event-failures-1 branch from 842a018 to 8ae6661 Compare August 15, 2025 23:24
`test_dup_htlc_onchain_doesnt_fail_on_reload` made reference to
`ChainMonitor` persisting `ChannelMonitor`s on each new block,
which hasn't been the case in some time. Instead, we update the
comment and code to make explicit that it doesn't impact the test.
During testsing, we check that a `ChannelMonitor` will round-trip
through serialization exactly. However, we recently added a fix to
change a value in `PackageTemplate` on reload to fix some issues in
the field in 0.1. This can cause the round-trip tests to fail as a
field is modified during read.

We fix it here by simply exempting the field from the equality test
in the condition where it would be updated on read.

We also make the `ChannelMonitor` `PartialEq` trait implementation
non-public as weird workarounds like this make clear that such a
comparison is a britle API at best.
On `ChannelManager` reload we rebuild the pending outbound payments
list by looking for any missing payments in `ChannelMonitor`s.
However, in the same loop over `ChannelMonitor`s, we also re-claim
any pending payments which we see we have a payment preimage for.

If we send an MPP payment across different chanels, the result may
be that we'll iterate the loop, and in each iteration add a
pending payment with only one known path, then claim/fail it and
remove the pending apyment (at least for the claim case). This may
result in spurious extra events, or even both a `PaymentFailed` and
`PaymentSent` event on startup for the same payment.
`MonitorEvent`s aren't delivered to the `ChannelManager` in a
durable fashion - if the `ChannelManager` fetches the pending
`MonitorEvent`s, then the `ChannelMonitor` gets persisted (i.e. due
to a block update) then the node crashes, prior to persisting the
`ChannelManager` again, the `MonitorEvent` and its effects on the
`ChannelManger` will be lost. This isn't likely in a sync persist
environment, but in an async one this could be an issue.

Note that this is only an issue for closed channels -
`MonitorEvent`s only inform the `ChannelManager` that a channel is
closed (which the `ChannelManager` will learn on startup or when it
next tries to advance the channel state), that
`ChannelMonitorUpdate` writes completed (which the `ChannelManager`
will detect on startup), or that HTLCs resolved on-chain post
closure. Of the three, only the last is problematic to lose prior
to a reload.

When we restart and, during `ChannelManager` load, see a
`ChannelMonitor` for a closed channel, we scan it for preimages
that we passed to it and re-apply those to any pending or forwarded
payments. However, we didn't scan it for preimages it learned from
transactions on-chain. In cases where a `MonitorEvent` is lost,
this can lead to a lost preimage. Here we fix it by simply tracking
preimages we learned on-chain the same way we track preimages
picked up during normal channel operation.
`MonitorEvent`s aren't delivered to the `ChannelManager` in a
durable fashion - if the `ChannelManager` fetches the pending
`MonitorEvent`s, then the `ChannelMonitor` gets persisted (i.e. due
to a block update) then the node crashes, prior to persisting the
`ChannelManager` again, the `MonitorEvent` and its effects on the
`ChannelManger` will be lost. This isn't likely in a sync persist
environment, but in an async one this could be an issue.

Note that this is only an issue for closed channels -
`MonitorEvent`s only inform the `ChannelManager` that a channel is
closed (which the `ChannelManager` will learn on startup or when it
next tries to advance the channel state), that
`ChannelMonitorUpdate` writes completed (which the `ChannelManager`
will detect on startup), or that HTLCs resolved on-chain post
closure. Of the three, only the last is problematic to lose prior
to a reload.

In a previous commit we handled the case of claimed HTLCs by
replaying payment preimages on startup to avoid `MonitorEvent` loss
causing us to miss an HTLC claim. Here we handle the HTLC-failed
case similarly.

Unlike with HTLC claims via preimage, we don't already have replay
logic in `ChannelManager` startup, but its easy enough to add one.
Luckily, we already track when an HTLC reaches permanently-failed
state in `ChannelMonitor` (i.e. it has `ANTI_REORG_DELAY`
confirmations on-chain on the failing transaction), so all we need
to do is add the ability to query for that and fail them on
`ChannelManager` startup.
@TheBlueMatt TheBlueMatt force-pushed the 2025-07-mon-event-failures-1 branch from 8ae6661 to 69b281b Compare August 15, 2025 23:26
@TheBlueMatt
Copy link
Collaborator Author

Rebased to update against the new funding logic in ChannelMonitor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 0.1 weekly goal Someone wants to land this this week
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants