- 
                Notifications
    You must be signed in to change notification settings 
- Fork 420
Handle async initial ChannelMonitor persistence failing on restart #1678
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle async initial ChannelMonitor persistence failing on restart #1678
Conversation
9ad3838    to
    ba2a2ba      
    Compare
  
    | Codecov ReportBase: 90.73% // Head: 91.17% // Increases project coverage by  
 
 Additional details and impacted files@@            Coverage Diff             @@
##             main    #1678      +/-   ##
==========================================
+ Coverage   90.73%   91.17%   +0.44%     
==========================================
  Files          87       87              
  Lines       46713    49843    +3130     
  Branches    46713    49843    +3130     
==========================================
+ Hits        42383    45444    +3061     
- Misses       4330     4399      +69     
 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More parse to be done to check accuracy of code and documentation. Nice the documentation effort towards users!
| /// secret we gave them that they shouldn't know). | ||
| /// | ||
| /// Broadcasting these transactions in the second case is UNSAFE, as they allow counterparty | ||
| /// side to punish you. Nevertheless you may want to broadcast them if counterparty doesn't | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good comment, you might even add "YOU WILL LOSS THE INTEGRALITY OF YOUR CHANNEL BALANCE" to make even clearer that punishment induce loss of the whole funds. It can be hard for an uninformed user to estimate the punishment scope at first sight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is #1106, see discussion there. Ultimately the issue is the docs were rewritten there assuming we're sticking with the "make sure you always store the first monitor locally and update it always" model, but we really want to drop that model (which this PR is the first step towards), so we're gonna have to rewrite those docs again soon :(.
| /// or a remote copy of this [`ChannelMonitor`] is no longer reachable and thus not updatable). | ||
| /// | ||
| /// When this is returned, [`ChannelManager`] will force-close the channel but *not* broadcast | ||
| /// our current commitment transaction. This avoids a dangerous case where a local disk failure | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in the future a ChannelManager could still decide to force-close the channel by broadcasting the current commitment transaction if there is a non-related or recoverable disk failure . Let's say you have a chain::Watch implementation coordinating a set of monitor replica (as documented in our glossary). If one of this monitor's Persist implementation is affected by a disk-failure, the error will be reported to the coordinator. If the number of monitor replica affected by a permanent failure is above some security or safety threshold, the coordinator could inform the ChannelManager to force-close channels by prevention (e.g to avoid the in-flight HTLCs exposure to increase). It should be safe to do so, as the latest state available in the available monitor replica should be equivalent to the latest off-chain one.
Just to say, it's good to have that level of documentation at the interface-level, however it's only documenting one model of deployment. As we refine those interfaces to support more fancy types of deployment, I think it would be good to gather it elsewhere, more accessible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I don't think that necessarily belongs in the interface directly, though. Users who wish to use a fancier model can manually broadcast at that point. We should, of course, support this, and document it, but I'm not sure the code needs to change. In any case, I basically want to rewrite all the monitor stuff soon-ish, soooo
        
          
                lightning/src/chain/mod.rs
              
                Outdated
          
        
      | /// [`PermanentFailure`]: ChannelMonitorUpdateResult::PermanentFailure | ||
| /// [`UpdateInProgress`]: ChannelMonitorUpdateResult::UpdateInProgress | ||
| /// [`ChannelManager`]: crate::ln::channelmanager::ChannelManager | ||
| UpdateInProgress, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you might note the trade-off brought by UpdateInProgress, if you allow the channel manager to keep moving on its own without state being fully persisted you gain in latency at the HTLC relay-level. However, you're increasing your risks w.r.t to state replication. This trade-off might be acceptable for some routing hops node operators if in the future scoring algorithms starts to encompass latency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, probably worth mentioning in general, not specific to just the async stuff.
1e90a4d    to
    c7590e7      
    Compare
  
    | Rebased after we landed #1106. Now the real work begins :) | 
c7590e7    to
    c88eb35      
    Compare
  
    |  | ||
| let persister: test_utils::TestPersister; | ||
| let new_chain_monitor: test_utils::TestChainMonitor; | ||
| let nodes_0_deserialized: ChannelManager<EnforcingSigner, &test_utils::TestChainMonitor, &test_utils::TestBroadcaster, &test_utils::TestKeysInterface, &test_utils::TestFeeEstimator, &test_utils::TestLogger>; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to make a TestChannelManager type alias for this, we have this same type declaration in several places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do it as a part of #1696?
| nodes[0].chain_source.watched_txn.lock().unwrap().clear(); | ||
| nodes[0].chain_source.watched_outputs.lock().unwrap().clear(); | ||
|  | ||
| let nodes_0_serialized = nodes[0].node.encode(); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we extract the ChannelManager reloading into a macro/function? This seems like something super useful to use across tests and would reduce a good bit of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, #1696. I'm not really excited about doing it here - this PR is a part of the bigger 0.1 series which is nontrivial and needs to keep moving.
| nodes[0].node = &nodes_0_deserialized; | ||
| assert!(nodes_0_read.is_empty()); | ||
|  | ||
| check_closed_event!(nodes[0], 1, ClosureReason::DisconnectedPeer); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's check that the counterparty also forgets the channel here and in the other test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't forget the channel, though. Or, at least, it only will if it connects and gets the error-unknown channel message. Is it worth adding a specific test for that here? (I did add a check-list_channels-is-empty check).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's not required for the sake of what we're testing, but we could either do that or connect blocks until the 2016 block deadline is reached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, feel free to squash.
e06229a    to
    37e5458      
    Compare
  
    | Squashed without further changes from  | 
        
          
                lightning/src/ln/channel.rs
              
                Outdated
          
        
      | } | ||
| if self.cur_holder_commitment_transaction_number == INITIAL_COMMITMENT_NUMBER - 1 && | ||
| self.cur_counterparty_commitment_transaction_number == INITIAL_COMMITMENT_NUMBER - 1 { | ||
| // If we're a 0-conf channel, and our commitment transaction numbers have both been | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand why 0-conf is relevant here. Is the first paragraph of the comment describing how we would get inside the enclosing if? (Would prefer that the comment precede the if if that is the case.) And the second paragraph describing the assertion in the if immediately below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I tend to always include comments describing how we got into a scope in the scope itself, otherwise the flow is hard to read due to everything being at the same indentation. I rewrote the comment a bit to split up the "why we're here" and the "what we're asserting".
| } | ||
|  | ||
| fn do_test_inbound_reload_without_init_mon(use_0conf: bool, lock_commitment: bool) { | ||
| // Test that if the monitor update generated by funding_generated is stored async and we | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
funding_created?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was referring to the channelmanager method, clarified.
| do_test_outbound_reload_without_init_mon(false); | ||
| } | ||
|  | ||
| fn do_test_inbound_reload_without_init_mon(use_0conf: bool, lock_commitment: bool) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can much of this be DRY'ed up with do_test_outbound_reload_without_init_mon if node[0] and node[1] were swapped? Seems like the diff between the two functions is mostly but not entirely index changes. Not sure if a combined function would be more or less readable with an is_outbound param.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so the first few hunks setting up the channel are the same, but then they diverge a good bit before doing a similar reload block. I guess we could DRY some of the initial stuff, but I'd really like to just DRY up all the reload stuff spewed across all our tests at once, see #1696. I'm not really sure its worth DRYing up the three or four hunks that are the same across the tests, either.
37e5458    to
    763ed22      
    Compare
  
    | Ready to merge after squash @TheBlueMatt. | 
If the initial ChannelMonitor persistence is done asynchronously but does not complete before the node restarts (with a ChannelManager persistence), we'll start back up with a channel present but no corresponding ChannelMonitor. Because the Channel is pending-monitor-update and has not yet broadcasted its initial funding transaction or sent channel_ready, this is not a violation of our API contract nor a safety violation. However, the previous code would refuse to deserialize the ChannelManager treating it as an API contract violation. The solution is to test for this case explicitly and drop the channel entirely as if the peer disconnected before we received the funding_signed for outbound channels or before sending the channel_ready for inbound channels.
As we're moving towards monitor update async being a supported use-case, we shouldn't call an async monitor update "failed", but rather "in progress". This simply updates the internal channel.rs enum name to reflect the new thinking.
763ed22    to
    958601f      
    Compare
  
    | Squashed without changes:  | 
Based on #1106, this is the second of many steps towards making the async monitor persistence feature more robust, eventually leading to 0.1 and a safe async KV store API.
If the initial ChannelMonitor persistence is done asynchronously
but does not complete before the node restarts (with a
ChannelManager persistence), we'll start back up with a channel
present but no corresponding ChannelMonitor.
Because the Channel is pending-monitor-update and has not yet
broadcasted its initial funding transaction, this is not a
violation of our API contract nor a safety violation. However, the
previous code would refuse to deserialize the ChannelManager
treating it as an API contract violation.
The solution is to test for this case explicitly and drop the
channel entirely as if the peer disconnected before we received
the funding_signed.