Skip to content

fix(sync): sync_wait_period() delays epoch sync stale detection by 10 seconds #15234

@shreyan-gupta

Description

@shreyan-gupta

Summary

sync_wait_period() in client_actor.rs treats NoPeers (no peer height info yet) the same as AlreadyCaughtUp, scheduling the next run_sync_step() call in 10 seconds (sync_check_period) instead of 10ms (sync_step_period). This causes a 10-second delay before epoch sync can detect that a restarting node is stale and needs data reset.

Root Cause

When a node restarts, the sync flow is:

  1. start_sync() polls every 10ms for min_num_peers. Once a peer connects (~100ms), sync_started = true.
  2. check_triggers() calls run_sync_step(), which calls syncing_info() to check if sync is needed.
  3. syncing_info() filters highest_height_peers. At this point, peers are connected but haven't reported heights yet (heights arrive asynchronously via SetNetworkInfo every ~100ms). So it returns NoPeers.
  4. With NoPeers, run_sync_step() sets SyncStatus::NoSync and returns without calling handle_sync_needed() — which is where epoch sync stale detection lives.
  5. sync_wait_period() decides when to schedule the next run_sync_step():
// BEFORE (buggy)
fn sync_wait_period(&self) -> Duration {
    if let Ok(sync) = self.syncing_info() {
        if !sync.sync_needed() {
            // NoPeers.sync_needed() returns false → TAKES THIS BRANCH
            self.client.config.sync_check_period      // 10 SECONDS!
        } else {
            self.client.config.sync_step_period       // 10ms
        }
    } else {
        self.client.config.sync_step_period
    }
}

SyncRequirement::sync_needed() returns matches!(self, Self::SyncNeeded { .. }), so NoPeersfalse → waits 10 seconds.

  1. Peer heights arrive ~100ms later via SetNetworkInfo, but run_sync_step() is already scheduled 10 seconds from now.
  2. After 10 seconds, run_sync_step() finally fires, sees SyncNeeded, calls handle_sync_needed(), epoch sync detects stale → writes marker.

Timeline

T=0ms     Node starts. sync_started=false.
T~100ms   First SetNetworkInfo: peers connected, sync_started=true.
          First run_sync_step(): no peer heights yet → NoPeers.
          sync_wait_period() sees NoPeers → schedules next in 10 SECONDS.
T~200ms   Next SetNetworkInfo arrives WITH peer heights.
          But run_sync_step() won't fire for another ~10 seconds.
T~10.1s   run_sync_step() finally fires.
          syncing_info() → SyncNeeded → handle_sync_needed() → epoch sync → stale detected.

The 10-second gap between T100ms and T10.1s is the bug.

Observed impact

Test Marker detection time
gc_sync_after_sync (2nd restart) 15.1s
state_sync4 10.6s
state_sync_missing_chunks 10.6s

Fast cases (1-2s) won the race condition — peer heights arrived before the first run_sync_step() call.

Fix

Treat NoPeers as "we don't know our sync state yet, check back quickly" instead of "we're caught up":

fn sync_wait_period(&self) -> Duration {
    match self.syncing_info() {
        Ok(sync) => match sync {
            SyncRequirement::SyncNeeded { .. } => self.client.config.sync_step_period,  // 10ms
            SyncRequirement::NoPeers => self.client.config.sync_step_period,             // 10ms (was 10s!)
            SyncRequirement::AlreadyCaughtUp { .. }
            | SyncRequirement::AdvHeaderSyncDisabled => self.client.config.sync_check_period, // 10s
        },
        Err(_) => self.client.config.sync_step_period,
    }
}

The semantic distinction:

  • AlreadyCaughtUp: We compared heights and we're caught up. Safe to wait 10s.
  • NoPeers: We have NO peer height info. We don't know our sync state. Retry in 10ms.

The 10ms polling is lightweight — syncing_info() just filters an in-memory vector.

Header sync before epoch sync: interaction with stale detection

Why header sync can run before epoch sync detects staleness

Epoch sync and header sync both run inside handle_sync_needed(), with epoch sync checked first. However, the epoch sync check can pass (return "not stale"), causing the code to fall through to header sync. This happens because the epoch sync stale check uses header_head:

// epoch.rs:153-158
let tip_height = chain.chain_store().header_head()?.height;
if tip_height + epoch_sync_horizon_num_epochs * epoch_length >= highest_height {
    return Ok(EpochSyncRunResult::Ok);  // within horizon → no epoch sync
}

If header_head + 4 * epoch_length >= highest_height, epoch sync passes and header sync runs.

How header sync can prevent epoch sync from ever triggering

Header sync is asynchronous — it requests headers in one sync cycle, and the headers arrive between cycles via BlockHeadersResponsereceive_headers()sync_block_headers(), which updates header_head in the DB. On the next sync cycle, epoch sync sees the updated (higher) header_head, making it even less likely to trigger.

Each header sync batch can advance header_head by up to 512 blocks. If even a single batch succeeds, header_head jumps significantly, potentially pushing the node permanently inside the 4-epoch horizon — even as highest_height continues growing at ~1 block/second.

Example (node 2-3 epochs behind, epoch_length=50):

Cycle 1: header_head=200, highest=350. 200 + 4*50 = 400 >= 350 → epoch sync passes.
          Header sync runs, requests headers.
Cycle 2: Headers arrived. header_head=350. 350 + 200 = 550 >= 360 → epoch sync still passes.
          State sync triggers (head 2+ epochs behind header_head).
          → State sync WITHOUT epoch sync → broken epoch sync proof (Issue #15244)

Three scenarios

Condition Epoch sync? Header sync? Outcome
header_head + 4*epoch_length < highest_height Triggers → data reset No Correct
header_head + 4*epoch_length >= highest_height (barely) No initially, yes later as highest_height grows Runs but stalls (peer GC'd old headers) Delayed but eventually correct
header_head + 4*epoch_length >= highest_height (comfortably, 2-3 epochs behind) Never triggers Succeeds, advances header_head further Bug — state sync without epoch sync (#15244)

The third row is the root cause of #15244. The sync_wait_period fix addressed the 10-second delay but does not change the epoch sync horizon check. The fundamental trigger misalignment (epoch sync horizon = 4 epochs, state sync threshold = 2 epochs) is tracked in #15244.

Observed case: gc_sync_after_sync 2nd restart

header_head = 129, epoch_length = 30, highest_height = 245
129 + 4*30 = 249 >= 245 → epoch sync passes (barely within horizon)
Falls through to header sync → "initial transition to header sync"
Header sync stalls (peer GC'd headers below ~155)
Network produces blocks: highest_height grows past 249
~3 seconds later: epoch sync triggers → data reset

This is the "borderline" case — epoch sync eventually triggers because header sync can't advance header_head (GC'd headers). Total delay: ~3s waiting for highest_height to exceed the horizon.

Impact beyond pytests

Any production node that briefly has no peers (e.g. after network partition recovery) would wait 10 seconds before its first sync step. With the fix, it polls every 10ms and starts syncing as soon as peer heights are available.

Code references

  • sync_wait_period(): chain/client/src/client_actor.rs:1750-1764
  • syncing_info(): chain/client/src/client_actor.rs:1642-1683
  • run_sync_step(): chain/client/src/client_actor.rs:1769-1812
  • handle_sync_needed() (epoch sync runs here): chain/client/src/sync/handler.rs:86-107
  • EpochSync::run() horizon check: chain/client/src/sync/epoch.rs:153-158
  • Header sync run(): chain/client/src/sync/header.rs:123-199
  • receive_headers() (advances header_head between sync cycles): chain/client/src/client_actor.rs:1614-1634
  • sync_check_period = 10s: core/chain-configs/src/client_config.rs:560-561
  • sync_step_period = 10ms: core/chain-configs/src/client_config.rs:568-569

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-bugCategory: This is a bug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions