fix(consensus): scheduling rebroadcast timeout as part of recovery by vbar · Pull Request #3295 · eqlabs/pathfinder

vbar · 2026-03-24T13:12:44Z

A potential fix for #3286 : When having recent (even finalized) heights in WAL on startup, schedule a rebroadcast of votes for them, as other nodes might not have got those votes due to the previous shutdown.

…om WAL

t00ts · 2026-03-24T15:12:34Z

I get the idea, but not sure this is the expected behavior when restoring from WAL.

Imho, this should not be opaque to the user. We should trigger this from the outside.

Happy to discuss further.

vbar · 2026-03-24T15:18:49Z

Imho, this should not be opaque to the user. We should trigger this from the outside.

Well, why? Scheduling the rebroadcast timeout is always opaque to the user (because it's always done by Malachite). Also, what is the use case for not triggering this?

t00ts · 2026-03-24T15:40:27Z

Well, why?

My thought process:

The "potential fix" already asked for caution.
Then the code: adding a network side effect inside a state restoration function to fix a liveliness issue smelled like bad design.

This is why I initially brought it up.

Now, going into details:

recover_from_wal is called in two situations:

Finalized heights within history_depth
Incomplete (in-progress) heights

This PR adds a rebroadcast timeout unconditionally in both cases.

what is the use case for not triggering this?

The case of finalized heights. The block is decided, and other nodes have moved on. Broadcasting these votes seems pointless here.

Scheduling the rebroadcast timeout is always opaque to the user (because it's always done by Malachite).

Yes, but we're mixing two different things here.

In live consensus, malachite scheduling these rebroadcasts is actual protocol behaviour. It's part of the liveliness mechanism.
In WAL recovery, scheduling a rebroadcast is more of a "startup policy" decision. It really is "what should the node do AFTER restoring its state".

This is why I'm advocating to trigger this from outside the restoration function. Something like the following could work:

internal_consensus.recover_from_wal(entries);
internal_consensus.schedule_rebroadcast_if_needed();

vbar · 2026-03-24T15:46:32Z

Well, why?
what is the use case for not triggering this?

The case of finalized heights. The block is decided, and other nodes have moved on. Broadcasting these votes seems pointless here.

No. #3286 describes exactly the case where the block is locally decided, but the other nodes did not move on (yet).

that doesn't gossip its final vote.

t00ts · 2026-03-24T16:39:35Z

Got it. In that case, what if we have recover_from_wal return the max round it found, thus keeping it as pure state restoration (no opaque network side effects), and let the caller schedule the rebroadcast explicitly?

In the consensus crate:

pub fn recover_from_wal(...) -> Option<Round> {
    // ...
    max_round.map(Round::from)
}

pub fn schedule_rebroadcast(&mut self, round: Round) {
    self.timeout_manager.schedule_timeout(Timeout {
        kind: TimeoutKind::Rebroadcast,
        round,
    });
}

And then in both call sites:

let max_round = internal_consensus.recover_from_wal(entries);
if let Some(round) = max_round {
    // Schedule rebroadcast timeout.
    // See https://github.com/eqlabs/pathfinder/issues/3286 for motivation.
    internal_consensus.schedule_rebroadcast(round);
}

I think this keeps concerns cleaner. Wdyt?

crates/consensus/src/internal.rs

vbar · 2026-03-25T07:33:42Z

Got it. In that case, what if we have recover_from_wal return the max round it found, thus keeping it as pure state restoration (no opaque network side effects), and let the caller schedule the rebroadcast explicitly?

I think this keeps concerns cleaner. Wdyt?

well, why not...

Co-authored-by: Abel E <abel.elbaile@gmail.com>

fix(consensus): scheduling rebroadcast timeout as part of recovery fr…

6539c78

…om WAL

vbar requested a review from a team as a code owner March 24, 2026 13:12

test(consensus): added a variant of ProposalCommitted test failure

434cf49

that doesn't gossip its final vote.

t00ts reviewed Mar 24, 2026

View reviewed changes

crates/consensus/src/internal.rs Outdated Show resolved Hide resolved

vbar and others added 3 commits March 25, 2026 08:34

Update crates/consensus/src/internal.rs

87bad32

Co-authored-by: Abel E <abel.elbaile@gmail.com>

fixup: move timeout scheduling to caller(s) of recover_from_wal

3a03c6d

fixup: test name

0100191

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(consensus): scheduling rebroadcast timeout as part of recovery#3295

fix(consensus): scheduling rebroadcast timeout as part of recovery#3295
vbar wants to merge 5 commits intomainfrom
vbar/consensus-rebroadcast-recovery

vbar commented Mar 24, 2026

Uh oh!

t00ts commented Mar 24, 2026

Uh oh!

vbar commented Mar 24, 2026

Uh oh!

t00ts commented Mar 24, 2026

Uh oh!

vbar commented Mar 24, 2026

Uh oh!

t00ts commented Mar 24, 2026

Uh oh!

Uh oh!

vbar commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vbar commented Mar 24, 2026

Uh oh!

t00ts commented Mar 24, 2026

Uh oh!

vbar commented Mar 24, 2026

Uh oh!

t00ts commented Mar 24, 2026

Uh oh!

vbar commented Mar 24, 2026

Uh oh!

t00ts commented Mar 24, 2026

Uh oh!

Uh oh!

vbar commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants