Skip to content

statement-store: reject statement peers during major sync#11449

Open
peter941221 wants to merge 10 commits intoparitytech:masterfrom
peter941221:peter/statement-store-major-sync
Open

statement-store: reject statement peers during major sync#11449
peter941221 wants to merge 10 commits intoparitytech:masterfrom
peter941221:peter/statement-store-major-sync

Conversation

@peter941221
Copy link

@peter941221 peter941221 commented Mar 20, 2026

Summary

This PR fixes #11411 and the adjacent statement-handler lifecycle holes it exposed by preventing peers from remaining in a half-connected statement-protocol state while the node is unavailable.

The core repair is still the original major-sync fix:

  1. Reject inbound statement substreams while the handler is unavailable.
  2. Disconnect statement streams that still open while unavailable.
  3. Disconnect already-connected peers that send statements while unavailable, and clean up their local statement-handler state.

Follow-up reviewer passes then closed the same half-connected state in nearby paths:

  • offline transitions,
  • async send failures,
  • outbound interruption,
  • duplicate and stale reopen events,
  • initial-sync fetch failures,
  • and false completion metrics on aborted cleanup paths.

Why this is needed

Today a peer can end up in a broader half-connected state:

  • the statement stream is open,
  • live statements are rejected or interrupted while the node is unavailable,
  • peers opened during that window may never get a clean initial-sync replay path if cleanup is incomplete,
  • and there is no later replay trigger.

That means statements can be missed permanently even though the peer still looks connected.

A key detail here is that gating only newly opening peers is not sufficient. Peers that were already connected before major sync starts can still hit the same loss window once sync state changes, and later reviewer passes found equivalent stale-state variants after send failures and reopen races.

Why disconnect instead of silently ignoring

Silently ignoring statements preserves the broken state: the peer stays connected on the statement protocol, but cannot contribute live statements and does not get a clean catch-up path.

Disconnecting is the smallest fix that reuses the existing lifecycle:

  • reject statement traffic while unavailable,
  • clean up local statement-handler state eagerly,
  • let the peer reconnect once the node is ready again,
  • and reuse the normal stream-open path to queue initial sync.

Tests

This branch now adds regression coverage for:

  • rejecting inbound statement substreams during major sync,
  • rejecting inbound statement substreams while offline,
  • disconnecting streams that open during major sync,
  • disconnecting streams that open while offline,
  • reconnecting after sync exit and re-queuing initial sync,
  • preserving the normal open path outside major sync,
  • disconnecting already-open peers that send while unavailable,
  • send failure retry and cleanup,
  • interrupted outbound propagation cleanup,
  • duplicate stream-open state replacement,
  • stale unavailable reopen cleanup,
  • initial-sync fetch failure cleanup,
  • and ensuring aborted initial sync cleanup does not inflate completion metrics.

Validation

  • rustfmt --edition 2021 substrate/client/network/statement/src/lib.rs
  • git diff --check -- substrate/client/network/statement/src/lib.rs
  • cargo check -p sc-network-statement --lib
  • cargo test -p sc-network-statement --lib -- --nocapture

Validation was completed in a Linux-native WSL checkout with real symlinks preserved. Earlier Windows-backed checkout failures were environment-specific and are no longer the limiting factor for this branch.

- reject inbound statement substreams while major syncing
- disconnect statement streams that open during the sync window
- disconnect already-connected peers that send during major sync
- add regression tests for reconnect and initial-sync recovery paths
@cla-bot-2021
Copy link

cla-bot-2021 bot commented Mar 20, 2026

User @peter941221, please sign the CLA here.

@peter941221
Copy link
Author

Quick update: I pushed two more reviewer-driven fixes on this branch:

  • 4b69bf3 — disconnect and cleanly retry if initial-sync statement fetch fails
  • 19bf2c0 — only record initial_sync_duration_seconds on true initial-sync completion

Both are covered by the current Linux-native validation:

  • cargo test -p sc-network-statement --lib -- --nocapture
  • cargo check -p sc-network-statement --lib

If you'd prefer the later hardening commits split into follow-up PRs, I'm happy to do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

statement-store: Race between is_major_sync and receiving statements

2 participants