Skip to content

fix(consensus): scheduling rebroadcast timeout as part of recovery#3295

Open
vbar wants to merge 2 commits intomainfrom
vbar/consensus-rebroadcast-recovery
Open

fix(consensus): scheduling rebroadcast timeout as part of recovery#3295
vbar wants to merge 2 commits intomainfrom
vbar/consensus-rebroadcast-recovery

Conversation

@vbar
Copy link
Copy Markdown
Contributor

@vbar vbar commented Mar 24, 2026

A potential fix for #3286 : When having recent (even finalized) heights in WAL on startup, schedule a rebroadcast of votes for them, as other nodes might not have got those votes due to the previous shutdown.

@vbar vbar requested a review from a team as a code owner March 24, 2026 13:12
@t00ts
Copy link
Copy Markdown
Contributor

t00ts commented Mar 24, 2026

I get the idea, but not sure this is the expected behavior when restoring from WAL.

Imho, this should not be opaque to the user. We should trigger this from the outside.

Happy to discuss further.

@vbar
Copy link
Copy Markdown
Contributor Author

vbar commented Mar 24, 2026

Imho, this should not be opaque to the user. We should trigger this from the outside.

Well, why? Scheduling the rebroadcast timeout is always opaque to the user (because it's always done by Malachite). Also, what is the use case for not triggering this?

@t00ts
Copy link
Copy Markdown
Contributor

t00ts commented Mar 24, 2026

Well, why?

My thought process:

  1. The "potential fix" already asked for caution.
  2. Then the code: adding a network side effect inside a state restoration function to fix a liveliness issue smelled like bad design.

This is why I initially brought it up.


Now, going into details:

recover_from_wal is called in two situations:

  1. Finalized heights within history_depth
  2. Incomplete (in-progress) heights

This PR adds a rebroadcast timeout unconditionally in both cases.

what is the use case for not triggering this?

The case of finalized heights. The block is decided, and other nodes have moved on. Broadcasting these votes seems pointless here.

Scheduling the rebroadcast timeout is always opaque to the user (because it's always done by Malachite).

Yes, but we're mixing two different things here.

  1. In live consensus, malachite scheduling these rebroadcasts is actual protocol behaviour. It's part of the liveliness mechanism.
  2. In WAL recovery, scheduling a rebroadcast is more of a "startup policy" decision. It really is "what should the node do AFTER restoring its state".

This is why I'm advocating to trigger this from outside the restoration function. Something like the following could work:

internal_consensus.recover_from_wal(entries);
internal_consensus.schedule_rebroadcast_if_needed();

@vbar
Copy link
Copy Markdown
Contributor Author

vbar commented Mar 24, 2026

Well, why?
what is the use case for not triggering this?

The case of finalized heights. The block is decided, and other nodes have moved on. Broadcasting these votes seems pointless here.

No. #3286 describes exactly the case where the block is locally decided, but the other nodes did not move on (yet).

@t00ts
Copy link
Copy Markdown
Contributor

t00ts commented Mar 24, 2026

Got it. In that case, what if we have recover_from_wal return the max round it found, thus keeping it as pure state restoration (no opaque network side effects), and let the caller schedule the rebroadcast explicitly?

In the consensus crate:

pub fn recover_from_wal(...) -> Option<Round> {
    // ...
    max_round.map(Round::from)
}

pub fn schedule_rebroadcast(&mut self, round: Round) {
    self.timeout_manager.schedule_timeout(Timeout {
        kind: TimeoutKind::Rebroadcast,
        round,
    });
}

And then in both call sites:

let max_round = internal_consensus.recover_from_wal(entries);
if let Some(round) = max_round {
    // Schedule rebroadcast timeout.
    // See https://github.com/eqlabs/pathfinder/issues/3286 for motivation.
    internal_consensus.schedule_rebroadcast(round);
}

I think this keeps concerns cleaner. Wdyt?

@vbar
Copy link
Copy Markdown
Contributor Author

vbar commented Mar 25, 2026

Got it. In that case, what if we have recover_from_wal return the max round it found, thus keeping it as pure state restoration (no opaque network side effects), and let the caller schedule the rebroadcast explicitly?

I think this keeps concerns cleaner. Wdyt?

well, why not...

t00ts
t00ts previously approved these changes Mar 27, 2026
Copy link
Copy Markdown
Contributor

@t00ts t00ts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % that doc comment

@vbar vbar force-pushed the vbar/consensus-rebroadcast-recovery branch 2 times, most recently from ce9ba9a to b17bca1 Compare March 30, 2026 06:19
@vbar vbar force-pushed the vbar/consensus-rebroadcast-recovery branch from e9b5d3a to 9904a5f Compare March 31, 2026 12:20
Copy link
Copy Markdown
Contributor

@CHr15F0x CHr15F0x left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM mod the rename.

feature = "consensus-integration-tests",
debug_assertions
))]
pub fn do_not_send_vote(vote_height: u64, inject_failure: Option<InjectFailureConfig>) -> bool {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub fn do_not_send_vote(vote_height: u64, inject_failure: Option<InjectFailureConfig>) -> bool {
pub fn debug_do_not_send_vote(vote_height: u64, inject_failure: Option<InjectFailureConfig>) -> bool {

feature = "consensus-integration-tests",
debug_assertions
)))]
pub fn do_not_send_vote(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub fn do_not_send_vote(
pub fn debug_do_not_send_vote(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants