-
Notifications
You must be signed in to change notification settings - Fork 762
Description
Summary
sync_wait_period() in client_actor.rs treats NoPeers (no peer height info yet) the same as AlreadyCaughtUp, scheduling the next run_sync_step() call in 10 seconds (sync_check_period) instead of 10ms (sync_step_period). This causes a 10-second delay before epoch sync can detect that a restarting node is stale and needs data reset.
Root Cause
When a node restarts, the sync flow is:
start_sync()polls every 10ms formin_num_peers. Once a peer connects (~100ms),sync_started = true.check_triggers()callsrun_sync_step(), which callssyncing_info()to check if sync is needed.syncing_info()filtershighest_height_peers. At this point, peers are connected but haven't reported heights yet (heights arrive asynchronously viaSetNetworkInfoevery ~100ms). So it returnsNoPeers.- With
NoPeers,run_sync_step()setsSyncStatus::NoSyncand returns without callinghandle_sync_needed()— which is where epoch sync stale detection lives. sync_wait_period()decides when to schedule the nextrun_sync_step():
// BEFORE (buggy)
fn sync_wait_period(&self) -> Duration {
if let Ok(sync) = self.syncing_info() {
if !sync.sync_needed() {
// NoPeers.sync_needed() returns false → TAKES THIS BRANCH
self.client.config.sync_check_period // 10 SECONDS!
} else {
self.client.config.sync_step_period // 10ms
}
} else {
self.client.config.sync_step_period
}
}SyncRequirement::sync_needed() returns matches!(self, Self::SyncNeeded { .. }), so NoPeers → false → waits 10 seconds.
- Peer heights arrive ~100ms later via
SetNetworkInfo, butrun_sync_step()is already scheduled 10 seconds from now. - After 10 seconds,
run_sync_step()finally fires, seesSyncNeeded, callshandle_sync_needed(), epoch sync detects stale → writes marker.
Timeline
T=0ms Node starts. sync_started=false.
T~100ms First SetNetworkInfo: peers connected, sync_started=true.
First run_sync_step(): no peer heights yet → NoPeers.
sync_wait_period() sees NoPeers → schedules next in 10 SECONDS.
T~200ms Next SetNetworkInfo arrives WITH peer heights.
But run_sync_step() won't fire for another ~10 seconds.
T~10.1s run_sync_step() finally fires.
syncing_info() → SyncNeeded → handle_sync_needed() → epoch sync → stale detected.
The 10-second gap between T100ms and T10.1s is the bug.
Observed impact
| Test | Marker detection time |
|---|---|
| gc_sync_after_sync (2nd restart) | 15.1s |
| state_sync4 | 10.6s |
| state_sync_missing_chunks | 10.6s |
Fast cases (1-2s) won the race condition — peer heights arrived before the first run_sync_step() call.
Fix
Treat NoPeers as "we don't know our sync state yet, check back quickly" instead of "we're caught up":
fn sync_wait_period(&self) -> Duration {
match self.syncing_info() {
Ok(sync) => match sync {
SyncRequirement::SyncNeeded { .. } => self.client.config.sync_step_period, // 10ms
SyncRequirement::NoPeers => self.client.config.sync_step_period, // 10ms (was 10s!)
SyncRequirement::AlreadyCaughtUp { .. }
| SyncRequirement::AdvHeaderSyncDisabled => self.client.config.sync_check_period, // 10s
},
Err(_) => self.client.config.sync_step_period,
}
}The semantic distinction:
AlreadyCaughtUp: We compared heights and we're caught up. Safe to wait 10s.NoPeers: We have NO peer height info. We don't know our sync state. Retry in 10ms.
The 10ms polling is lightweight — syncing_info() just filters an in-memory vector.
Header sync before epoch sync: interaction with stale detection
Why header sync can run before epoch sync detects staleness
Epoch sync and header sync both run inside handle_sync_needed(), with epoch sync checked first. However, the epoch sync check can pass (return "not stale"), causing the code to fall through to header sync. This happens because the epoch sync stale check uses header_head:
// epoch.rs:153-158
let tip_height = chain.chain_store().header_head()?.height;
if tip_height + epoch_sync_horizon_num_epochs * epoch_length >= highest_height {
return Ok(EpochSyncRunResult::Ok); // within horizon → no epoch sync
}If header_head + 4 * epoch_length >= highest_height, epoch sync passes and header sync runs.
How header sync can prevent epoch sync from ever triggering
Header sync is asynchronous — it requests headers in one sync cycle, and the headers arrive between cycles via BlockHeadersResponse → receive_headers() → sync_block_headers(), which updates header_head in the DB. On the next sync cycle, epoch sync sees the updated (higher) header_head, making it even less likely to trigger.
Each header sync batch can advance header_head by up to 512 blocks. If even a single batch succeeds, header_head jumps significantly, potentially pushing the node permanently inside the 4-epoch horizon — even as highest_height continues growing at ~1 block/second.
Example (node 2-3 epochs behind, epoch_length=50):
Cycle 1: header_head=200, highest=350. 200 + 4*50 = 400 >= 350 → epoch sync passes.
Header sync runs, requests headers.
Cycle 2: Headers arrived. header_head=350. 350 + 200 = 550 >= 360 → epoch sync still passes.
State sync triggers (head 2+ epochs behind header_head).
→ State sync WITHOUT epoch sync → broken epoch sync proof (Issue #15244)
Three scenarios
| Condition | Epoch sync? | Header sync? | Outcome |
|---|---|---|---|
header_head + 4*epoch_length < highest_height |
Triggers → data reset | No | Correct |
header_head + 4*epoch_length >= highest_height (barely) |
No initially, yes later as highest_height grows |
Runs but stalls (peer GC'd old headers) | Delayed but eventually correct |
header_head + 4*epoch_length >= highest_height (comfortably, 2-3 epochs behind) |
Never triggers | Succeeds, advances header_head further |
Bug — state sync without epoch sync (#15244) |
The third row is the root cause of #15244. The sync_wait_period fix addressed the 10-second delay but does not change the epoch sync horizon check. The fundamental trigger misalignment (epoch sync horizon = 4 epochs, state sync threshold = 2 epochs) is tracked in #15244.
Observed case: gc_sync_after_sync 2nd restart
header_head = 129, epoch_length = 30, highest_height = 245
129 + 4*30 = 249 >= 245 → epoch sync passes (barely within horizon)
Falls through to header sync → "initial transition to header sync"
Header sync stalls (peer GC'd headers below ~155)
Network produces blocks: highest_height grows past 249
~3 seconds later: epoch sync triggers → data reset
This is the "borderline" case — epoch sync eventually triggers because header sync can't advance header_head (GC'd headers). Total delay: ~3s waiting for highest_height to exceed the horizon.
Impact beyond pytests
Any production node that briefly has no peers (e.g. after network partition recovery) would wait 10 seconds before its first sync step. With the fix, it polls every 10ms and starts syncing as soon as peer heights are available.
Code references
sync_wait_period():chain/client/src/client_actor.rs:1750-1764syncing_info():chain/client/src/client_actor.rs:1642-1683run_sync_step():chain/client/src/client_actor.rs:1769-1812handle_sync_needed()(epoch sync runs here):chain/client/src/sync/handler.rs:86-107EpochSync::run()horizon check:chain/client/src/sync/epoch.rs:153-158- Header sync
run():chain/client/src/sync/header.rs:123-199 receive_headers()(advances header_head between sync cycles):chain/client/src/client_actor.rs:1614-1634sync_check_period= 10s:core/chain-configs/src/client_config.rs:560-561sync_step_period= 10ms:core/chain-configs/src/client_config.rs:568-569