-
Notifications
You must be signed in to change notification settings - Fork 16
Description
Problem
node::manager.rs should be broken up into smaller mods. The spawn_epoch_vote_collector is one example of code that would benefit from separating into a separate file.
Summary
crates/node/src/manager.rs is nearly 1900 lines and mixes epoch vote collection logic with the broader EpochManager orchestration. The epoch vote collector should be extracted into its own module, and several logic improvements should be evaluated.
Refactoring
- Move
spawn_epoch_vote_collectorandepoch_vote_collector_testsinto a dedicated module (e.g.,crates/node/src/epoch_vote_collector.rs) to reduce LOC inmanager.rsand improve logical organization.
Epoch Vote Collector Logic Improvements
The following items should be evaluated as part of or after the extraction:
1. Retrieving epoch record by number
The consensus_db.get_epoch_by_number check that skips collection when a cert already exists uses a tokio::select with epoch_rx.changed() => continue. Confirm this correctly moves to the next epoch record without skipping any watch updates.
2. Vote digest validation
Ensure that epoch vote signatures and digests are validated at the network handler layer before they reach the collector. The collector currently trusts that signatures are already verified — confirm this assumption holds and document where validation occurs.
3. Alt-record vote tracking — potential attack vector
Currently, votes for alternative epoch records are tracked in alt_recs but the voter's key is not removed from committee_keys. This means a malicious actor could send multiple votes for different alt hashes and inflate the count, potentially triggering the votes + 1 >= quorum error branch and breaking out of the collection loop early. Committee members should be removed from the tracking set after casting an alt vote to prevent double-counting.
4. Drain vote_rx in the initial tokio::select loop
While unlikely given ~24hr epoch boundaries, votes could theoretically accumulate in the mpsc buffer while the collector is waiting for an epoch record. Adding vote_rx as a branch in the first tokio::select loop would drain and buffer early votes, preventing potential overflow under adversarial conditions.
5. Handle ejected validators that signed an epoch record (#500)
Address the edge case described in #500 where ejected validators may have already signed an epoch record.
6. Replace timeout with interval for vote collection
The current approach uses a 5-second timeout that shortens to 1 second after quorum, with a hard cap of 12 timeouts (~60 seconds total). Consider replacing this with a tokio::time::interval that allows votes to arrive for the duration of the epoch. This would lower the chances of a failed epoch record while still providing a mechanism to finalize once quorum is reached and re-gossip records.
7. Evaluate fallback approach for ensuring epoch records are signed
When quorum is not reached locally, the collector falls back to requesting epoch certs from peers (up to 3 attempts). Evaluate whether this fallback is sufficient or if additional strategies are needed (e.g., periodic retries, longer collection windows, or broadcasting a request to multiple peers simultaneously).
follow up from #534