Skip to content

Conversation

@pwltr
Copy link
Contributor

@pwltr pwltr commented Oct 29, 2025

Similar to (reverted) #286, but only change the logic (writing locally first) for new channels, to cover push balance channels.

Fix: Critical Channel Monitor Persistence Bug

🚨 Critical Issue Fixed

This PR resolves a critical bug that could lead to unrecoverable fund loss when new channels were created and remote backup failed or the app was killed during backup operations. LDK normally ensures that the channel material is durably persisted before it will ever accept an HTLC on that channel, but it does not do so on the channel open itself. So any balance pushed to us in the channel open is not secured. This PR adds extra logic to handle this case.

🐛 Problem

  • For new channels: Channel monitor persistence was conditionally dependent on remote backup success
  • If remote backup failed or the app was killed during backup, local persistence was skipped for new channels
  • Multiple failure scenarios could cause fund loss:
    • App killed during backup retries (before callback)
    • No network connection (permanent backup failure)
    • Network issues during backup
    • Backup server down
  • When channel monitor was lost, channels would force-close with "unknown channel" errors
  • Result: Permanent loss of channel state and unrecoverable funds for new channels

✅ Solution

  • For new channels: Persist locally first (synchronously), then attempt remote backup asynchronously
    • Eliminates race condition where app kill during backup would cause data loss
    • Ensures new channels are always protected immediately
  • For channel updates: Maintain use remote-first approach (original behavior)
    • Local persistence happens after successful remote backup
  • New channels now survive all network failure scenarios and app kill scenarios

🔧 Technical Changes

  • New channels: Write to local disk synchronously before queueing remote backup
  • Channel updates: Keep original remote-first flow (try remote, then local on success)
  • Return ChannelMonitorUpdateStatus.Completed immediately for new channels (after local write)
  • Return ChannelMonitorUpdateStatus.InProgress for updates (wait for remote backup)

Copy link
Contributor

@Jasonvdb Jasonvdb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic looks good to me. Shouldn't have to run channelMonitorUpdated or anything on the main/ui thread in swift or kotlin though. Don't think it does any harm but also not needed.

if isNew {
try serialized.write(to: channelStoragePath)

// Notify chain monitor on main thread
Copy link
Contributor

@Jasonvdb Jasonvdb Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't need to call channelMonitorUpdated on main thread. Not sure it matters if you do but it shouldn't be required.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If main thread is needed then do the same below for the other call to channelMonitorUpdated

Copy link
Contributor Author

@pwltr pwltr Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The threading wrapper is indeed necessary, after another test. It only coincidentally works in the persistQueue because of the timing not conflicting with LDK's read-write lock, I guess. cc @ovitrif

Short write-up by the agent:

The channelMonitorUpdated() method must be called on the main thread due to LDK's internal thread safety requirements.

Crash occurred in std::sys_common::rwlock::MovableRwLock::read when called from a background thread. LDK's ChainMonitor uses internal read-write locks that are not thread-safe when accessed from background threads.

Call Stack:

Thread 13 (crashed):
├── LdkPersister.handleChannel() 
├── channelMonitorUpdated() 
├── LDK internal ChainMonitor code
└── Read-write lock panic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the other call to channelMonitorUpdated() to also run on the main thread, as suggested.

ovitrif
ovitrif previously approved these changes Oct 29, 2025
Copy link
Contributor

@ovitrif ovitrif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK, looks good with one nit about not having to run the update on main thread, same as Jay's remark ^^

Copy link
Contributor

@ovitrif ovitrif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK, nice way to answer to our threading remarks: nothing beats testing 🙌🏻 !

@pwltr pwltr merged commit efc40ef into master Oct 30, 2025
1 of 6 checks passed
@ovitrif ovitrif deleted the fix/new-channel-backup branch October 31, 2025 12:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants