Make event handling fallible #2995

tnull · 2024-04-15T09:12:07Z

Closes #2490,
Closes #2097.

Previously, we would require our users to handle all events successfully inline or panic will trying to do so. If they would exit the EventHandler any other way we'd forget about the event and wouldn't replay them after restart.

Here, we implement fallible event handling, allowing the user to return Err(()) which signals to our event providers they should abort event processing and replay any unhandled events later (i.e., in the next invocation).

TODO:

Add test coverage for replay behavior on Err(()).

lightning-background-processor/src/lib.rs

lightning/src/onion_message/messenger.rs

TheBlueMatt · 2024-04-15T15:01:14Z

Previously, we would require our users to handle all events successfully inline or panic will trying to do so

I believe our recommendation was always to simply loop trying to handle the event until they succeed, which is basically what we're doing here for them :)

As to the code here, I think we should make more clear in the interface the event will be replayed, eg by making the error variant a unit struct called ReplayEvent or so. Further, I think we should set the wakeup flag immediately on any failed event-handle to force the BP to go around its loop again without any sleeping. Otherwise concept lgtm.

codecov-commenter · 2024-05-27T10:03:31Z

Codecov Report

Attention: Patch coverage is 63.91753% with 70 lines in your changes missing coverage. Please review.

Project coverage is 89.75%. Comparing base (0cfe55c) to head (e617a39).

Files	Patch %	Lines
lightning/src/onion_message/messenger.rs	58.06%	21 Missing and 5 partials ⚠️
lightning/src/util/async_poll.rs	34.61%	14 Missing and 3 partials ⚠️
lightning-background-processor/src/lib.rs	80.48%	5 Missing and 11 partials ⚠️
lightning/src/chain/chainmonitor.rs	46.15%	6 Missing and 1 partial ⚠️
lightning/src/chain/channelmonitor.rs	40.00%	3 Missing ⚠️
lightning/src/events/mod.rs	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2995      +/-   ##
==========================================
- Coverage   89.79%   89.75%   -0.05%     
==========================================
  Files         121      121              
  Lines      100826   100916      +90     
  Branches   100826   100916      +90     
==========================================
+ Hits        90537    90576      +39     
- Misses       7614     7658      +44     
- Partials     2675     2682       +7

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

TheBlueMatt

See-also my above comments.

lightning-background-processor/src/lib.rs

lightning/src/ln/channelmanager.rs

tnull · 2024-06-03T12:50:39Z

Rebased and included a fixup to introduce ReplayEvent unit struct error variant. But still ironing out some details regarding the approach.

Further, I think we should set the wakeup flag immediately on any failed event-handle to force the BP to go around its loop again without any sleeping.

Could you clarify which flag you are referring to exactly? Do you mean just continueing the loop in case any of the event processors fails? If we do that, we should probably make sure it only happens once (and then maaaybe with an exponential back-off?) as otherwise the entire run loop might busy-wait if event handling keeps failing, e.g., due to a persistence failure?

TheBlueMatt · 2024-06-03T14:25:35Z

Could you clarify which flag you are referring to exactly?

The event_persist_notifier flag.

as otherwise the entire run loop might busy-wait

I think that's okay if the BP loop busy-waits? If we're blocked waiting for something to complete a busy-wait is a perfectly reasonable way to signal to the user that something is horribly wrong (maybe they'll file a bug report asking why their phone is getting hot) :)

tnull · 2024-06-03T15:42:30Z

The event_persist_notifier flag.

I'm confused: wouldn't this only work for ChannelManager's event handler, not the others, i.e., ChainMonitor, OnionMessenger?

I think that's okay if the BP loop busy-waits? If we're blocked waiting for something to complete a busy-wait is a perfectly reasonable way to signal to the user that something is horribly wrong (maybe they'll file a bug report asking why their phone is getting hot) :)

Hmm, I tend to disagree? That might be okay in blocking land where event handling would run on its own thread, but in async land this might steal a full working thread from the runtime, possibly leading to locking up the node entirely? So I'd prefer not to busy-wait without ever yielding in an async context?

TheBlueMatt · 2024-06-03T15:50:58Z

I'm confused: wouldn't this only work for ChannelManager's event handler, not the others, i.e., ChainMonitor, OnionMessenger?

Sure, they should all do a similar thing.

Hmm, I tend to disagree? That might be okay in blocking land where event handling would run on its own thread, but in async land this might steal a full working thread from the runtime, possibly leading to locking up the node entirely? So I'd prefer not to busy-wait without ever yielding in an async context?

Hmm, as long as the user has an async...anything handling the event that's failing we should yield at least once during the BP loop. We could add an explicit yield (in the form of a ~1ms sleep), I guess?

tnull · 2024-06-05T09:11:08Z

Rebased to resolve conflicts after #3060 landed, have yet to adjust MultiFuturePoller though.

tnull · 2024-07-02T13:31:56Z

Now added logic to retain failed events in the OnionMessenger event queues to have them replayed upon next invocation. To do this I adjusted MultiFuturePoller to collect and return the actual event handling results.

Two observations:

It's a bit unfortunate that this requires us to clone the queues as we can only remove the events from the queues after they have been successfully processed (but we have the same issue in CM/CM).
While we replay events, we of course still won't persist the OnionMessenger events in any way. While I understand this is a deliberate design choice for DoS protection, it seems also like an API contract violation as events might get lost if the node restarts/crashes between event generation and the user handling it.

The event_persist_notifier flag.

Coming back to this: Upon further inspection it seems we set result = NotifyOption::DoPersist in process_events_body whenever we have pending events anyways? So the event_persist_notifier should already get triggered?

Generally this is still missing some test coverage, but should be ready for another round of review apart from that.

TheBlueMatt · 2024-07-02T20:56:22Z

Coming back to this: Upon further inspection it seems we set result = NotifyOption::DoPersist in process_events_body whenever we have pending events anyways? So the event_persist_notifier should already get triggered?

Ah, indeed, you're right. We should, however, add something similar in chainmonitor.

lightning/src/chain/channelmonitor.rs

TheBlueMatt · 2024-07-02T21:02:18Z

lightning/src/events/mod.rs

+/// An error type that may be returned to LDK in order to safely abort event handling if it can't
+/// currently succeed (e.g., due to a persistence failure).
+///
+/// LDK will ensure the event is persisted and will eventually be replayed.


We shouldn't be so cut-and-dry here cause its not really true - we may well (persist and) replay the event, but depending on the event type we may not. I'm not really sure how to accurately communicate this to users, though, at least short of documenting it for each individual event type (#2097).

Hmm, yeah, this comment predates the discussion we had offline. I think I'll see to pick up #2097 as part of this PR also.

Now took a stab at addressing #2097. Let me know if you think we should add more fine-grained context to some variants.

Thanks! It looks great, though we also really need to add some additional context around events that are "robust" vs not - eg you could open a channel, have it closed and restart without ever persisting, implying a "lost" DiscardFunding. Doesn't have to happen in this PR, though is an implied part of #2097.

I'm a bit confused by the DiscardFunding example, as it seems to me it would indeed be persisted across restart if it was ever generated? At least all codepaths I see that lead to finish_close_channel seem to set DoPersist or notify_on_drop?

Could it be that you're referring to the issue described in #2508, which however means DiscardFunding wouldn't get generated in the first place?

I'm referring to what happens if the ChannelManager (or any other thing) is not persisted. #2508 does come up here, but this generally applies to all events.

Hmm, may be worth opening another issue to that end?

Yea, I don't think it needs to be addressed here, we should just not close #2097 until we address it.

Now opened #3186 to track this specifically as #2097 didn't mention this super clearly, IMO. Will let #2097 close with the changes in this PR.

tnull · 2024-07-03T08:50:36Z

Ah, indeed, you're right. We should, however, add something similar in chainmonitor.

Agreed. Now added a fixup that has ChainMonitor trigger its event_notifier when any of the ChannelMonitor::process_pending_events calls fails.

tnull · 2024-07-08T13:56:11Z

Added a simple BackgroundProcessor test to check events are being replayed and also added a commit documenting the failure-to-handle/persistence behavior of all event variants (i.e., addressed #2097). I now also dropped the serialization logic for OnionMessageIntercepted and OnionMessagePeerConnected, as we don't ever write these events.

jkczyz

LGTM. Feel free to squash, IMO.

lightning/src/onion_message/messenger.rs

tnull · 2024-07-17T17:21:02Z

LGTM. Feel free to squash, IMO.

Squashed without further changes.

TheBlueMatt · 2024-07-17T18:59:05Z

The Make event handling fallible commit itself doesn't build, so check_commits is failing.

TheBlueMatt

A few nits and one real comment, but only worth fixing cause you have to rewrite some commits anyway.

lightning/src/util/async_poll.rs

lightning/src/events/mod.rs

tnull

The Make event handling fallible commit itself doesn't build, so check_commits is failing.

Squashed the MultiResultFuturePoller commit and added further fixups addressing the nits.

lightning/src/events/mod.rs

lightning/src/util/async_poll.rs

This is a minor refactor that will allow us to access the individual event queue Mutexes separately, allowing us to drop the locks earlier when processing them individually.

jkczyz

LGTM

Previously, we would require our users to handle all events successfully inline or panic will trying to do so. If they would exit the `EventHandler` any other way we'd forget about the event and wouldn't replay them after restart. Here, we implement fallible event handling, allowing the user to return `Err(())` which signals to our event providers they should abort event processing and replay any unhandled events later (i.e., in the next invocation).

Previously, we would just fire-and-forget in `OnionMessenger`'s event handling. Since we now introduced the possibility of event handling failures, we here adapt the event handling logic to retain any events which we failed to handle to have them replayed upon the next invocation of `process_pending_events`/`process_pending_events_async`.

tnull · 2024-07-18T13:55:12Z

Squashed fixups without further changes:

> git diff-tree -U2  258853aed e617a394e
>

TheBlueMatt

Happy to land this, but probably needs some small followups.

lightning/src/onion_message/messenger.rs

lightning-background-processor/src/lib.rs

lightning/src/events/mod.rs

tnull marked this pull request as draft April 15, 2024 09:12

tnull commented Apr 15, 2024

View reviewed changes

lightning-background-processor/src/lib.rs Outdated Show resolved Hide resolved

tnull commented Apr 15, 2024

View reviewed changes

lightning/src/onion_message/messenger.rs Show resolved Hide resolved

tnull mentioned this pull request May 16, 2024

VSS Integration Tracking lightningdevkit/ldk-node#246

Open

12 tasks

tnull force-pushed the 2024-04-fallible-event-handler branch from 5a45ebe to 34d7a9b Compare May 27, 2024 10:03

TheBlueMatt reviewed May 29, 2024

View reviewed changes

lightning-background-processor/src/lib.rs Outdated Show resolved Hide resolved

lightning-background-processor/src/lib.rs Outdated Show resolved Hide resolved

lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved

tnull mentioned this pull request May 30, 2024

Add BOLT12 support lightningdevkit/ldk-node#256

Merged

tnull force-pushed the 2024-04-fallible-event-handler branch from 34d7a9b to 772a851 Compare June 3, 2024 12:43

tnull force-pushed the 2024-04-fallible-event-handler branch from 772a851 to 4debd21 Compare June 5, 2024 09:09

G8XSU self-requested a review June 6, 2024 17:40

tnull force-pushed the 2024-04-fallible-event-handler branch 2 times, most recently from fd89e2a to 43a4a04 Compare July 2, 2024 13:11

tnull marked this pull request as ready for review July 2, 2024 13:22

TheBlueMatt reviewed Jul 2, 2024

View reviewed changes

tnull force-pushed the 2024-04-fallible-event-handler branch 2 times, most recently from 7dc55e5 to 188edca Compare July 3, 2024 08:48

tnull force-pushed the 2024-04-fallible-event-handler branch from 188edca to 985056c Compare July 8, 2024 13:51

tnull force-pushed the 2024-04-fallible-event-handler branch 2 times, most recently from d20af7c to 6e83263 Compare July 8, 2024 14:01

tnull force-pushed the 2024-04-fallible-event-handler branch 2 times, most recently from 291f42e to 456bcf5 Compare July 17, 2024 08:21

tnull mentioned this pull request Jul 17, 2024

Document persistence failure behavior for all events #3186

Open

jkczyz reviewed Jul 17, 2024

View reviewed changes

lightning/src/onion_message/messenger.rs Show resolved Hide resolved

tnull force-pushed the 2024-04-fallible-event-handler branch from 456bcf5 to 1c2478d Compare July 17, 2024 17:20

jkczyz previously approved these changes Jul 17, 2024

View reviewed changes

TheBlueMatt reviewed Jul 17, 2024

View reviewed changes

lightning/src/util/async_poll.rs Outdated Show resolved Hide resolved

lightning/src/util/async_poll.rs Outdated Show resolved Hide resolved

lightning/src/events/mod.rs Show resolved Hide resolved

tnull commented Jul 18, 2024

View reviewed changes

lightning/src/events/mod.rs Show resolved Hide resolved

lightning/src/util/async_poll.rs Outdated Show resolved Hide resolved

lightning/src/util/async_poll.rs Outdated Show resolved Hide resolved

tnull dismissed jkczyz’s stale review via cbcd88f July 18, 2024 07:05

tnull force-pushed the 2024-04-fallible-event-handler branch from 1c2478d to cbcd88f Compare July 18, 2024 07:05

Hold sep. Mutexes for pending intercepted_msgs/peer_connected events

b5b57f1

This is a minor refactor that will allow us to access the individual event queue Mutexes separately, allowing us to drop the locks earlier when processing them individually.

tnull force-pushed the 2024-04-fallible-event-handler branch from cbcd88f to 258853a Compare July 18, 2024 07:06

jkczyz reviewed Jul 18, 2024

View reviewed changes

tnull added 4 commits July 18, 2024 15:54

Add simple test for event replaying

8599bc9

Document Failure Behavior and Persistence for every event type

e617a39

tnull force-pushed the 2024-04-fallible-event-handler branch from 258853a to e617a39 Compare July 18, 2024 13:54

jkczyz approved these changes Jul 18, 2024

View reviewed changes

TheBlueMatt approved these changes Jul 18, 2024

View reviewed changes

lightning/src/onion_message/messenger.rs Show resolved Hide resolved

lightning-background-processor/src/lib.rs Show resolved Hide resolved

lightning/src/events/mod.rs Show resolved Hide resolved

TheBlueMatt merged commit 2bfddea into lightningdevkit:main Jul 18, 2024

TheBlueMatt mentioned this pull request Jul 18, 2024

#2995 Followups #3191

Closed

tnull mentioned this pull request Jul 19, 2024

#2995 followups #3193

Merged

tnull mentioned this pull request Sep 23, 2024

Replace verbose event processing logging #3331

Closed

tnull mentioned this pull request Nov 7, 2024

Run async event processing in parallel #2491

Open

tnull mentioned this pull request Jan 30, 2025

Better document synchronicity on event handling #1194

Closed

Make event handling fallible #2995

Make event handling fallible #2995

Uh oh!

Conversation

tnull commented Apr 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TheBlueMatt commented Apr 15, 2024

Uh oh!

codecov-commenter commented May 27, 2024 • edited by codecov bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

TheBlueMatt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tnull commented Jun 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheBlueMatt commented Jun 3, 2024

Uh oh!

tnull commented Jun 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheBlueMatt commented Jun 3, 2024

Uh oh!

tnull commented Jun 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tnull commented Jul 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheBlueMatt commented Jul 2, 2024

Uh oh!

Uh oh!

TheBlueMatt Jul 2, 2024

Choose a reason for hiding this comment

Uh oh!

tnull Jul 3, 2024

Choose a reason for hiding this comment

Uh oh!

tnull Jul 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TheBlueMatt Jul 8, 2024

Choose a reason for hiding this comment

Uh oh!

tnull Jul 9, 2024

Choose a reason for hiding this comment

Uh oh!

TheBlueMatt Jul 15, 2024

Choose a reason for hiding this comment

Uh oh!

tnull Jul 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TheBlueMatt Jul 16, 2024

Choose a reason for hiding this comment

Uh oh!

tnull Jul 17, 2024

Choose a reason for hiding this comment

Uh oh!

tnull commented Jul 3, 2024

Uh oh!

tnull commented Jul 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkczyz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tnull commented Jul 17, 2024

Uh oh!

tnull commented Apr 15, 2024 •

edited

Loading

codecov-commenter commented May 27, 2024 •

edited by codecov bot

Loading

tnull commented Jun 3, 2024 •

edited

Loading

tnull commented Jun 3, 2024 •

edited

Loading

tnull commented Jun 5, 2024 •

edited

Loading

tnull commented Jul 2, 2024 •

edited

Loading

tnull Jul 8, 2024 •

edited

Loading

tnull Jul 16, 2024 •

edited

Loading

tnull commented Jul 8, 2024 •

edited

Loading

tnull left a comment •

edited

Loading