Skip to content

fix(consensus): scheduling rebroadcast timeout as part of recovery#3295

Open
vbar wants to merge 5 commits intomainfrom
vbar/consensus-rebroadcast-recovery
Open

fix(consensus): scheduling rebroadcast timeout as part of recovery#3295
vbar wants to merge 5 commits intomainfrom
vbar/consensus-rebroadcast-recovery

Conversation

@vbar
Copy link
Contributor

@vbar vbar commented Mar 24, 2026

A potential fix for #3286 : When having recent (even finalized) heights in WAL on startup, schedule a rebroadcast of votes for them, as other nodes might not have got those votes due to the previous shutdown.

@vbar vbar requested a review from a team as a code owner March 24, 2026 13:12
@t00ts
Copy link
Contributor

t00ts commented Mar 24, 2026

I get the idea, but not sure this is the expected behavior when restoring from WAL.

Imho, this should not be opaque to the user. We should trigger this from the outside.

Happy to discuss further.

@vbar
Copy link
Contributor Author

vbar commented Mar 24, 2026

Imho, this should not be opaque to the user. We should trigger this from the outside.

Well, why? Scheduling the rebroadcast timeout is always opaque to the user (because it's always done by Malachite). Also, what is the use case for not triggering this?

@t00ts
Copy link
Contributor

t00ts commented Mar 24, 2026

Well, why?

My thought process:

  1. The "potential fix" already asked for caution.
  2. Then the code: adding a network side effect inside a state restoration function to fix a liveliness issue smelled like bad design.

This is why I initially brought it up.


Now, going into details:

recover_from_wal is called in two situations:

  1. Finalized heights within history_depth
  2. Incomplete (in-progress) heights

This PR adds a rebroadcast timeout unconditionally in both cases.

what is the use case for not triggering this?

The case of finalized heights. The block is decided, and other nodes have moved on. Broadcasting these votes seems pointless here.

Scheduling the rebroadcast timeout is always opaque to the user (because it's always done by Malachite).

Yes, but we're mixing two different things here.

  1. In live consensus, malachite scheduling these rebroadcasts is actual protocol behaviour. It's part of the liveliness mechanism.
  2. In WAL recovery, scheduling a rebroadcast is more of a "startup policy" decision. It really is "what should the node do AFTER restoring its state".

This is why I'm advocating to trigger this from outside the restoration function. Something like the following could work:

internal_consensus.recover_from_wal(entries);
internal_consensus.schedule_rebroadcast_if_needed();

@vbar
Copy link
Contributor Author

vbar commented Mar 24, 2026

Well, why?
what is the use case for not triggering this?

The case of finalized heights. The block is decided, and other nodes have moved on. Broadcasting these votes seems pointless here.

No. #3286 describes exactly the case where the block is locally decided, but the other nodes did not move on (yet).

@t00ts
Copy link
Contributor

t00ts commented Mar 24, 2026

Got it. In that case, what if we have recover_from_wal return the max round it found, thus keeping it as pure state restoration (no opaque network side effects), and let the caller schedule the rebroadcast explicitly?

In the consensus crate:

pub fn recover_from_wal(...) -> Option<Round> {
    // ...
    max_round.map(Round::from)
}

pub fn schedule_rebroadcast(&mut self, round: Round) {
    self.timeout_manager.schedule_timeout(Timeout {
        kind: TimeoutKind::Rebroadcast,
        round,
    });
}

And then in both call sites:

let max_round = internal_consensus.recover_from_wal(entries);
if let Some(round) = max_round {
    // Schedule rebroadcast timeout.
    // See https://github.com/eqlabs/pathfinder/issues/3286 for motivation.
    internal_consensus.schedule_rebroadcast(round);
}

I think this keeps concerns cleaner. Wdyt?

@vbar
Copy link
Contributor Author

vbar commented Mar 25, 2026

Got it. In that case, what if we have recover_from_wal return the max round it found, thus keeping it as pure state restoration (no opaque network side effects), and let the caller schedule the rebroadcast explicitly?

I think this keeps concerns cleaner. Wdyt?

well, why not...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants