- 
                Notifications
    You must be signed in to change notification settings 
- Fork 421
Make event handling fallible #2995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make event handling fallible #2995
Conversation
| 
 I believe our recommendation was always to simply loop trying to handle the event until they succeed, which is basically what we're doing here for them :) As to the code here, I think we should make more clear in the interface the event will be replayed, eg by making the error variant a unit struct called  | 
5a45ebe    to
    34d7a9b      
    Compare
  
    | Codecov ReportAttention: Patch coverage is  
 Additional details and impacted files@@            Coverage Diff             @@
##             main    #2995      +/-   ##
==========================================
- Coverage   89.79%   89.75%   -0.05%     
==========================================
  Files         121      121              
  Lines      100826   100916      +90     
  Branches   100826   100916      +90     
==========================================
+ Hits        90537    90576      +39     
- Misses       7614     7658      +44     
- Partials     2675     2682       +7     ☔ View full report in Codecov by Sentry. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See-also my above comments.
34d7a9b    to
    772a851      
    Compare
  
    | Rebased and included a fixup to introduce  
 Could you clarify which flag you are referring to exactly? Do you mean just  | 
| 
 The  
 I think that's okay if the BP loop busy-waits? If we're blocked waiting for something to complete a busy-wait is a perfectly reasonable way to signal to the user that something is horribly wrong (maybe they'll file a bug report asking why their phone is getting hot) :) | 
| 
 I'm confused: wouldn't this only work for  
 Hmm, I tend to disagree? That might be okay in blocking land where event handling would run on its own thread, but in  | 
| 
 Sure, they should all do a similar thing. 
 Hmm, as long as the user has an async...anything handling the event that's failing we should yield at least once during the BP loop. We could add an explicit yield (in the form of a ~1ms sleep), I guess? | 
772a851    to
    4debd21      
    Compare
  
    | Rebased to resolve conflicts after #3060 landed, have yet to adjust  | 
fd89e2a    to
    43a4a04      
    Compare
  
    | Now added logic to retain failed events in the  Two observations: 
 
 Coming back to this: Upon further inspection it seems we set  Generally this is still missing some test coverage, but should be ready for another round of review apart from that. | 
| 
 Ah, indeed, you're right. We should, however, add something similar in  | 
        
          
                lightning/src/events/mod.rs
              
                Outdated
          
        
      | /// An error type that may be returned to LDK in order to safely abort event handling if it can't | ||
| /// currently succeed (e.g., due to a persistence failure). | ||
| /// | ||
| /// LDK will ensure the event is persisted and will eventually be replayed. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't be so cut-and-dry here cause its not really true - we may well (persist and) replay the event, but depending on the event type we may not. I'm not really sure how to accurately communicate this to users, though, at least short of documenting it for each individual event type (#2097).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yeah, this comment predates the discussion we had offline. I think I'll see to pick up #2097 as part of this PR also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now took a stab at addressing #2097. Let me know if you think we should add more fine-grained context to some variants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! It looks great, though we also really need to add some additional context around events that are "robust" vs not - eg you could open a channel, have it closed and restart without  ever persisting, implying a "lost" DiscardFunding. Doesn't have to happen in this PR, though is an implied part of #2097.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused by the DiscardFunding example, as it seems to me it would indeed be persisted across restart if it was ever generated? At least all codepaths I see that lead to finish_close_channel seem to set DoPersist or notify_on_drop?
Could it be that you're referring to the issue described in #2508, which however means DiscardFunding wouldn't get generated in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm referring to what happens if the ChannelManager (or any other thing) is not persisted. #2508 does come up here, but this generally applies to all events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, may be worth opening another issue to that end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I don't think it needs to be addressed here, we should just not close #2097 until we address it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7dc55e5    to
    188edca      
    Compare
  
    | 
 Agreed. Now added a fixup that has  | 
188edca    to
    985056c      
    Compare
  
    | Added a simple BackgroundProcessor test to check events are being replayed and also added a commit documenting the failure-to-handle/persistence behavior of all event variants (i.e., addressed #2097). I now also dropped the serialization logic for  | 
d20af7c    to
    6e83263      
    Compare
  
    291f42e    to
    456bcf5      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Feel free to squash, IMO.
456bcf5    to
    1c2478d      
    Compare
  
    | 
 Squashed without further changes. | 
| The  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few nits and one real comment, but only worth fixing cause you have to rewrite some commits anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
Make event handling falliblecommit itself doesn't build, so check_commits is failing.
Squashed the MultiResultFuturePoller commit and added further fixups addressing the nits.
1c2478d    to
    cbcd88f      
    Compare
  
    This is a minor refactor that will allow us to access the individual event queue Mutexes separately, allowing us to drop the locks earlier when processing them individually.
cbcd88f    to
    258853a      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Previously, we would require our users to handle all events successfully inline or panic will trying to do so. If they would exit the `EventHandler` any other way we'd forget about the event and wouldn't replay them after restart. Here, we implement fallible event handling, allowing the user to return `Err(())` which signals to our event providers they should abort event processing and replay any unhandled events later (i.e., in the next invocation).
Previously, we would just fire-and-forget in `OnionMessenger`'s event handling. Since we now introduced the possibility of event handling failures, we here adapt the event handling logic to retain any events which we failed to handle to have them replayed upon the next invocation of `process_pending_events`/`process_pending_events_async`.
258853a    to
    e617a39      
    Compare
  
    | Squashed fixups without further changes: > git diff-tree -U2  258853aed e617a394e
> | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to land this, but probably needs some small followups.
Closes #2490,
Closes #2097.
Previously, we would require our users to handle all events successfully inline or panic will trying to do so. If they would exit the
EventHandlerany other way we'd forget about the event and wouldn't replay them after restart.Here, we implement fallible event handling, allowing the user to return
Err(())which signals to our event providers they should abort event processing and replay any unhandled events later (i.e., in the next invocation).TODO:
Err(()).