Replay lost `MonitorEvent`s in some cases for closed channels #4004

TheBlueMatt · 2025-08-11T20:17:28Z

This is the first few commits from #3984 which handles some of the trivial cases, but skips actually marking MonitorEvents completed which is needed for some further edge-cases.

ldk-reviews-bot · 2025-08-11T20:17:30Z

👋 Thanks for assigning @wpaulino as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

valentinewallace · 2025-08-11T20:39:28Z

One of the new tests is failing

TheBlueMatt · 2025-08-12T01:47:02Z

Oops, silent rebase conflict due to #4001.

valentinewallace · 2025-08-12T13:57:29Z

Test is still failing

TheBlueMatt · 2025-08-12T14:46:27Z

Grr, the #4001 changes were random so had to fix all the paths, should be fine now.

codecov · 2025-08-12T15:00:03Z

Codecov Report

❌ Patch coverage is 94.91094% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.79%. Comparing base (96f9242) to head (f809e6c).
⚠️ Report is 66 commits behind head on main.

Files with missing lines	Patch %	Lines
lightning/src/ln/monitor_tests.rs	95.14%	6 Missing and 6 partials ⚠️
lightning/src/chain/package.rs	70.37%	8 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4004      +/-   ##
==========================================
+ Coverage   88.61%   88.79%   +0.17%     
==========================================
  Files         174      176       +2     
  Lines      127640   128521     +881     
  Branches   127640   128521     +881     
==========================================
+ Hits       113113   114121    +1008     
+ Misses      12046    11811     -235     
- Partials     2481     2589     +108

Flag	Coverage Δ
fuzzing	`21.87% <0.00%> (?)`
tests	`88.62% <94.91%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

lightning/src/ln/channelmanager.rs

valentinewallace · 2025-08-12T15:05:50Z

lightning/src/ln/channelmanager.rs

+			for (channel_id, monitor) in args.channel_monitors.iter() {
+				let mut is_channel_closed = false;
+				let counterparty_node_id = monitor.get_counterparty_node_id();
+				if let Some(peer_state_mtx) = per_peer_state.get(&counterparty_node_id) {
+					let mut peer_state_lock = peer_state_mtx.lock().unwrap();
+					let peer_state = &mut *peer_state_lock;
+					is_channel_closed = !peer_state.channel_by_id.contains_key(channel_id);
+				}


I'm not seeing the bug that this is fixing -- if a payment is failed after being fulfilled in outbound_payments, we'll terminate early after noticing it's fulfilled already in OutboundPayments::fail_htlc

Context from the commit message: "... we could get both a PaymentFailed and a PaymentClaimed event on startup for the same payment"

Lets say we send an MPP payment along two different channels. One is claimed and the other failed (who knows why, the recipient just decided they don't like money for whatever reason).

Going through the single loop we may first find the failed-htlc channel - we'll add the pending payment in insert_from_monitor_on_startup which will just add the one part, then we'll see it failed and generate a PaymentFailed event since there's only one part. Then we'll go to the second channel and repeat the same process, but now with a PaymentSent event.

valentinewallace · 2025-08-12T15:13:45Z

lightning/src/ln/channelmanager.rs

+			for (channel_id, monitor) in args.channel_monitors.iter() {
+				let mut is_channel_closed = false;
+				let counterparty_node_id = monitor.get_counterparty_node_id();
+				if let Some(peer_state_mtx) = per_peer_state.get(&counterparty_node_id) {
+					let mut peer_state_lock = peer_state_mtx.lock().unwrap();
+					let peer_state = &mut *peer_state_lock;
+					is_channel_closed = !peer_state.channel_by_id.contains_key(channel_id);
+				}


Would appreciate some more context on this part of the commit message: "this can lead to a pending payment getting re-added and re-claimed multiple times" (phrased as bad).

It looks like in the prior code and this new code, we'll call OutboundPayments::insert_from_monitor_on_startup for each session_priv (seems fine), and then call claim_htlc for each session_priv (seems fine since this will only generate a PaymentSent event on the first call to claim_htlc). Can you help me understand what I'm missing that was buggy in the prior code?

I rewrote the commit message.

lightning/src/ln/monitor_tests.rs

valentinewallace · 2025-08-12T18:31:00Z

lightning/src/chain/channelmonitor.rs

+			($holder_commitment: expr, $htlc_iter: expr) => {
+				for (htlc, source) in $htlc_iter {
+					let filter = |v: &&IrrevocablyResolvedHTLC| {
+						v.commitment_tx_output_idx == htlc.transaction_output_index


The output index can be None on both sides currently, is that intentional?

Uhhh, it wasn't, but I believe it is almost the correct behavior - if we have a dust HTLC that never made it to chain and the commitment is confirmed, we want to consider it resolved. I'll tweak it to make it clearer and more likely correct.

valentinewallace · 2025-08-12T18:34:09Z

lightning/src/chain/channelmonitor.rs

+					};
+					if let Some(state) = us.htlcs_resolved_on_chain.iter().filter(filter).next() {
+						if let Some(source) = source {
+							if state.payment_preimage.is_none() {


Tests pass when removing this check atm. Probably fine if the test is too annoying to write though.

I believe its unobservable. There are several cases where we'd hit a theoretical else block, but the ChannelManager read logic actually replays preimages first, so we'd always actually replay the preimage first, then get here, spuriously include it, and not have anything to fail.

lightning/src/chain/channelmonitor.rs

ldk-reviews-bot · 2025-08-12T18:49:33Z

👋 The first review has been submitted!

Do you think this PR is ready for a second reviewer? If so, click here to assign a second reviewer.

ldk-reviews-bot · 2025-08-13T13:36:57Z

✅ Added second reviewer: @joostjager

graphite-app · 2025-08-13T14:39:28Z

lightning/src/chain/channelmonitor.rs

+		{
+			walk_htlcs!(
+				false,
+				us.funding.counterparty_claimable_outpoints.get(&txid).unwrap().iter().filter_map(


Potential panic due to unwrap() on None value. The code calls us.funding.counterparty_claimable_outpoints.get(&txid).unwrap() but there's no guarantee that the txid exists in the map. If the txid is not found, this will panic at runtime. Should use if let Some(outpoints) = us.funding.counterparty_claimable_outpoints.get(&txid) or similar safe pattern instead of unwrap().

Suggested change

us.funding.counterparty_claimable_outpoints.get(&txid).unwrap().iter().filter_map(

us.funding.counterparty_claimable_outpoints.get(&txid).unwrap_or(&Vec::new()).iter().filter_map(

Spotted by Diamond

Is this helpful? React 👍 or 👎 to let us know.

lightning/src/ln/monitor_tests.rs

valentinewallace · 2025-08-13T16:08:18Z

lightning/src/ln/channelmanager.rs

+							"Failing HTLC with payment hash {} as it was resolved on-chain.",
+							htlc.payment_hash
+						);
+						failed_htlcs.push((


Will we generate a PaymentFailed and/or HTLCHandlingFailed event for these HTLCs on every restart until the monitor is removed? Could be worth noting in a comment if so

That's fixed in #3984 :)

valentinewallace

LGTM after squash

lightning/src/chain/channelmonitor.rs

lightning/src/ln/monitor_tests.rs

TheBlueMatt · 2025-08-15T23:27:13Z

Rebased to update against the new funding logic in ChannelMonitor.

wpaulino · 2025-08-18T22:59:21Z

Feel free to squash, this changed enough I'll be doing another full pass

`test_dup_htlc_onchain_doesnt_fail_on_reload` made reference to `ChainMonitor` persisting `ChannelMonitor`s on each new block, which hasn't been the case in some time. Instead, we update the comment and code to make explicit that it doesn't impact the test.

During testsing, we check that a `ChannelMonitor` will round-trip through serialization exactly. However, we recently added a fix to change a value in `PackageTemplate` on reload to fix some issues in the field in 0.1. This can cause the round-trip tests to fail as a field is modified during read. We fix it here by simply exempting the field from the equality test in the condition where it would be updated on read. We also make the `ChannelMonitor` `PartialEq` trait implementation non-public as weird workarounds like this make clear that such a comparison is a britle API at best.

TheBlueMatt · 2025-08-19T23:36:39Z

Rebased and pushed with the following diff to account for the new channel closure monitor updates:

diff --git a/lightning/src/ln/monitor_tests.rs b/lightning/src/ln/monitor_tests.rs
index 609b39d336..8ac995e9b3 100644
--- a/lightning/src/ln/monitor_tests.rs
+++ b/lightning/src/ln/monitor_tests.rs
@@ -3425,7 +3425,7 @@ fn do_test_lost_preimage_monitor_events(on_counterparty_tx: bool) {
 	// latest state.
 	let mon_events = nodes[1].chain_monitor.chain_monitor.release_pending_monitor_events();
 	assert_eq!(mon_events.len(), 1);
-	assert_eq!(mon_events[0].2.len(), 2);
+	assert_eq!(mon_events[0].2.len(), 3);

 	let node_ser = nodes[1].node.encode();
 	let mon_a_ser = get_monitor!(nodes[1], chan_a).encode();
@@ -3661,7 +3661,7 @@ fn do_test_lost_timeout_monitor_events(confirm_tx: CommitmentType, dust_htlcs: b
 	// latest state.
 	let mon_events = nodes[1].chain_monitor.chain_monitor.release_pending_monitor_events();
 	assert_eq!(mon_events.len(), 1);
-	assert_eq!(mon_events[0].2.len(), 2);
+	assert_eq!(mon_events[0].2.len(), 3);

 	let node_ser = nodes[1].node.encode();
 	let mon_a_ser = get_monitor!(nodes[1], chan_a).encode();

ldk-reviews-bot · 2025-08-20T16:06:20Z

🔔 1st Reminder

Hey @wpaulino! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

lightning/src/ln/channelmanager.rs

lightning/src/chain/channelmonitor.rs

valentinewallace · 2025-08-21T15:23:39Z

Some tests are failing

TheBlueMatt · 2025-08-21T19:31:09Z

Huh, must be more rebase to hit that.

ldk-reviews-bot · 2025-08-23T14:33:05Z

🔔 1st Reminder

Hey @wpaulino! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot · 2025-08-25T14:33:47Z

🔔 2nd Reminder

Hey @wpaulino! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

wpaulino

LGTM after squash

On `ChannelManager` reload we rebuild the pending outbound payments list by looking for any missing payments in `ChannelMonitor`s. However, in the same loop over `ChannelMonitor`s, we also re-claim any pending payments which we see we have a payment preimage for. If we send an MPP payment across different chanels, the result may be that we'll iterate the loop, and in each iteration add a pending payment with only one known path, then claim/fail it and remove the pending apyment (at least for the claim case). This may result in spurious extra events, or even both a `PaymentFailed` and `PaymentSent` event on startup for the same payment.

While the case of not having a per-peer-state for a peer that we have a `ChannelMonitor` with should be unreachable by this point in `ChannelManager` loading (we have to at least store the latest update id of the monitor in the peer state), considering a channel as still-live and not replaying its payment state when we don't have a per-peer-state is wrong, so here we fix it.

`MonitorEvent`s aren't delivered to the `ChannelManager` in a durable fashion - if the `ChannelManager` fetches the pending `MonitorEvent`s, then the `ChannelMonitor` gets persisted (i.e. due to a block update) then the node crashes, prior to persisting the `ChannelManager` again, the `MonitorEvent` and its effects on the `ChannelManger` will be lost. This isn't likely in a sync persist environment, but in an async one this could be an issue. Note that this is only an issue for closed channels - `MonitorEvent`s only inform the `ChannelManager` that a channel is closed (which the `ChannelManager` will learn on startup or when it next tries to advance the channel state), that `ChannelMonitorUpdate` writes completed (which the `ChannelManager` will detect on startup), or that HTLCs resolved on-chain post closure. Of the three, only the last is problematic to lose prior to a reload. When we restart and, during `ChannelManager` load, see a `ChannelMonitor` for a closed channel, we scan it for preimages that we passed to it and re-apply those to any pending or forwarded payments. However, we didn't scan it for preimages it learned from transactions on-chain. In cases where a `MonitorEvent` is lost, this can lead to a lost preimage. Here we fix it by simply tracking preimages we learned on-chain the same way we track preimages picked up during normal channel operation.

`MonitorEvent`s aren't delivered to the `ChannelManager` in a durable fashion - if the `ChannelManager` fetches the pending `MonitorEvent`s, then the `ChannelMonitor` gets persisted (i.e. due to a block update) then the node crashes, prior to persisting the `ChannelManager` again, the `MonitorEvent` and its effects on the `ChannelManger` will be lost. This isn't likely in a sync persist environment, but in an async one this could be an issue. Note that this is only an issue for closed channels - `MonitorEvent`s only inform the `ChannelManager` that a channel is closed (which the `ChannelManager` will learn on startup or when it next tries to advance the channel state), that `ChannelMonitorUpdate` writes completed (which the `ChannelManager` will detect on startup), or that HTLCs resolved on-chain post closure. Of the three, only the last is problematic to lose prior to a reload. In a previous commit we handled the case of claimed HTLCs by replaying payment preimages on startup to avoid `MonitorEvent` loss causing us to miss an HTLC claim. Here we handle the HTLC-failed case similarly. Unlike with HTLC claims via preimage, we don't already have replay logic in `ChannelManager` startup, but its easy enough to add one. Luckily, we already track when an HTLC reaches permanently-failed state in `ChannelMonitor` (i.e. it has `ANTI_REORG_DELAY` confirmations on-chain on the failing transaction), so all we need to do is add the ability to query for that and fail them on `ChannelManager` startup.

TheBlueMatt · 2025-08-26T01:07:50Z

Squashed without further changes.

valentinewallace

Reviewed the rewritten last commit and test changes, LGTM

TheBlueMatt · 2025-10-06T15:22:10Z

Backported in #4143

TheBlueMatt added backport 0.1 weekly goal Someone wants to land this this week labels Aug 11, 2025

TheBlueMatt requested a review from valentinewallace August 11, 2025 20:17

TheBlueMatt force-pushed the 2025-07-mon-event-failures-1 branch from 658c408 to dc20515 Compare August 11, 2025 20:18

TheBlueMatt force-pushed the 2025-07-mon-event-failures-1 branch from dc20515 to e076dd5 Compare August 12, 2025 01:46

TheBlueMatt force-pushed the 2025-07-mon-event-failures-1 branch from e076dd5 to 86a58b6 Compare August 12, 2025 14:46

valentinewallace reviewed Aug 12, 2025

View reviewed changes

TheBlueMatt force-pushed the 2025-07-mon-event-failures-1 branch from 86a58b6 to 87ae44d Compare August 12, 2025 20:48

ldk-reviews-bot requested a review from joostjager August 13, 2025 13:36

TheBlueMatt requested review from valentinewallace and wpaulino and removed request for joostjager August 13, 2025 13:37

TheBlueMatt mentioned this pull request Aug 13, 2025

Correctly handle lost MonitorEvents #3984

Merged

graphite-app bot reviewed Aug 13, 2025

View reviewed changes

valentinewallace reviewed Aug 13, 2025

View reviewed changes

valentinewallace reviewed Aug 14, 2025

View reviewed changes

wpaulino reviewed Aug 14, 2025

View reviewed changes

lightning/src/chain/channelmonitor.rs Outdated Show resolved Hide resolved

lightning/src/chain/channelmonitor.rs Outdated Show resolved Hide resolved

lightning/src/ln/monitor_tests.rs Outdated Show resolved Hide resolved

TheBlueMatt force-pushed the 2025-07-mon-event-failures-1 branch 2 times, most recently from 8ae6661 to 69b281b Compare August 15, 2025 23:26

TheBlueMatt requested a review from wpaulino August 18, 2025 16:06

TheBlueMatt added 2 commits August 19, 2025 21:13

TheBlueMatt force-pushed the 2025-07-mon-event-failures-1 branch from cc9db83 to 2c9934c Compare August 19, 2025 23:36

wpaulino reviewed Aug 20, 2025

View reviewed changes

TheBlueMatt force-pushed the 2025-07-mon-event-failures-1 branch from 2c9934c to be18acd Compare August 21, 2025 14:32

TheBlueMatt requested a review from wpaulino August 21, 2025 14:32

wpaulino reviewed Aug 25, 2025

View reviewed changes

TheBlueMatt added 5 commits August 26, 2025 01:07

rustfmt and clean up get_onchain_failed_outbound_htlcs

9186900

TheBlueMatt force-pushed the 2025-07-mon-event-failures-1 branch from f77d037 to f809e6c Compare August 26, 2025 01:07

TheBlueMatt requested review from valentinewallace and wpaulino August 26, 2025 01:07

wpaulino approved these changes Aug 26, 2025

View reviewed changes

valentinewallace approved these changes Aug 26, 2025

View reviewed changes

valentinewallace merged commit d16e65d into lightningdevkit:main Aug 26, 2025
24 checks passed

TheBlueMatt added this to Weekly Goals Aug 28, 2025

TheBlueMatt self-assigned this Aug 28, 2025

TheBlueMatt mentioned this pull request Oct 6, 2025

0.1.6 Initial backports #4143

Merged

TheBlueMatt removed the backport 0.1 label Oct 6, 2025

	us.funding.counterparty_claimable_outpoints.get(&txid).unwrap().iter().filter_map(
	us.funding.counterparty_claimable_outpoints.get(&txid).unwrap_or(&Vec::new()).iter().filter_map(

Replay lost MonitorEvents in some cases for closed channels #4004

Replay lost MonitorEvents in some cases for closed channels #4004

Conversation

TheBlueMatt commented Aug 11, 2025

Uh oh!

ldk-reviews-bot commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

valentinewallace commented Aug 11, 2025

Uh oh!

TheBlueMatt commented Aug 12, 2025

Uh oh!

valentinewallace commented Aug 12, 2025

Uh oh!

TheBlueMatt commented Aug 12, 2025

Uh oh!

codecov bot commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ldk-reviews-bot commented Aug 12, 2025

Uh oh!

ldk-reviews-bot commented Aug 13, 2025

Uh oh!

graphite-app bot Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

valentinewallace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TheBlueMatt commented Aug 15, 2025

Uh oh!

wpaulino commented Aug 18, 2025

Uh oh!

TheBlueMatt commented Aug 19, 2025

Uh oh!

ldk-reviews-bot commented Aug 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

valentinewallace commented Aug 21, 2025

Uh oh!

TheBlueMatt commented Aug 21, 2025

Uh oh!

ldk-reviews-bot commented Aug 23, 2025

Uh oh!

ldk-reviews-bot commented Aug 25, 2025

Replay lost `MonitorEvent`s in some cases for closed channels #4004

Replay lost `MonitorEvent`s in some cases for closed channels #4004

ldk-reviews-bot commented Aug 11, 2025 •

edited

Loading

codecov bot commented Aug 12, 2025 •

edited

Loading