Skip to content

Refactor manager.rs: Extract Epoch Vote Collector and Harden Logic #547

@grantkee

Description

@grantkee

Problem

node::manager.rs should be broken up into smaller mods. The spawn_epoch_vote_collector is one example of code that would benefit from separating into a separate file.

Summary

crates/node/src/manager.rs is nearly 1900 lines and mixes epoch vote collection logic with the broader EpochManager orchestration. The epoch vote collector should be extracted into its own module, and several logic improvements should be evaluated.

Refactoring

  • Move spawn_epoch_vote_collector and epoch_vote_collector_tests into a dedicated module (e.g., crates/node/src/epoch_vote_collector.rs) to reduce LOC in manager.rs and improve logical organization.

Epoch Vote Collector Logic Improvements

The following items should be evaluated as part of or after the extraction:

1. Retrieving epoch record by number

The consensus_db.get_epoch_by_number check that skips collection when a cert already exists uses a tokio::select with epoch_rx.changed() => continue. Confirm this correctly moves to the next epoch record without skipping any watch updates.

2. Vote digest validation

Ensure that epoch vote signatures and digests are validated at the network handler layer before they reach the collector. The collector currently trusts that signatures are already verified — confirm this assumption holds and document where validation occurs.

3. Alt-record vote tracking — potential attack vector

Currently, votes for alternative epoch records are tracked in alt_recs but the voter's key is not removed from committee_keys. This means a malicious actor could send multiple votes for different alt hashes and inflate the count, potentially triggering the votes + 1 >= quorum error branch and breaking out of the collection loop early. Committee members should be removed from the tracking set after casting an alt vote to prevent double-counting.

4. Drain vote_rx in the initial tokio::select loop

While unlikely given ~24hr epoch boundaries, votes could theoretically accumulate in the mpsc buffer while the collector is waiting for an epoch record. Adding vote_rx as a branch in the first tokio::select loop would drain and buffer early votes, preventing potential overflow under adversarial conditions.

5. Handle ejected validators that signed an epoch record (#500)

Address the edge case described in #500 where ejected validators may have already signed an epoch record.

6. Replace timeout with interval for vote collection

The current approach uses a 5-second timeout that shortens to 1 second after quorum, with a hard cap of 12 timeouts (~60 seconds total). Consider replacing this with a tokio::time::interval that allows votes to arrive for the duration of the epoch. This would lower the chances of a failed epoch record while still providing a mechanism to finalize once quorum is reached and re-gossip records.

7. Evaluate fallback approach for ensuring epoch records are signed

When quorum is not reached locally, the collector falls back to requesting epoch certs from peers (up to 3 attempts). Evaluate whether this fallback is sufficient or if additional strategies are needed (e.g., periodic retries, longer collection windows, or broadcasting a request to multiple peers simultaneously).

follow up from #534

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions