fix(sync): sync_wait_period() delays epoch sync stale detection by 10 seconds

## Summary

`sync_wait_period()` in `client_actor.rs` treats `NoPeers` (no peer height info yet) the same as `AlreadyCaughtUp`, scheduling the next `run_sync_step()` call in 10 seconds (`sync_check_period`) instead of 10ms (`sync_step_period`). This causes a 10-second delay before epoch sync can detect that a restarting node is stale and needs data reset.

## Root Cause

When a node restarts, the sync flow is:

1. `start_sync()` polls every 10ms for `min_num_peers`. Once a peer connects (~100ms), `sync_started = true`.
2. `check_triggers()` calls `run_sync_step()`, which calls `syncing_info()` to check if sync is needed.
3. `syncing_info()` filters `highest_height_peers`. At this point, peers are connected but haven't reported heights yet (heights arrive asynchronously via `SetNetworkInfo` every ~100ms). So it returns `NoPeers`.
4. With `NoPeers`, `run_sync_step()` sets `SyncStatus::NoSync` and returns **without calling `handle_sync_needed()`** — which is where epoch sync stale detection lives.
5. `sync_wait_period()` decides when to schedule the next `run_sync_step()`:

```rust
// BEFORE (buggy)
fn sync_wait_period(&self) -> Duration {
    if let Ok(sync) = self.syncing_info() {
        if !sync.sync_needed() {
            // NoPeers.sync_needed() returns false → TAKES THIS BRANCH
            self.client.config.sync_check_period      // 10 SECONDS!
        } else {
            self.client.config.sync_step_period       // 10ms
        }
    } else {
        self.client.config.sync_step_period
    }
}
```

`SyncRequirement::sync_needed()` returns `matches!(self, Self::SyncNeeded { .. })`, so `NoPeers` → `false` → waits 10 seconds.

6. Peer heights arrive ~100ms later via `SetNetworkInfo`, but `run_sync_step()` is already scheduled 10 seconds from now.
7. After 10 seconds, `run_sync_step()` finally fires, sees `SyncNeeded`, calls `handle_sync_needed()`, epoch sync detects stale → writes marker.

### Timeline

```
T=0ms     Node starts. sync_started=false.
T~100ms   First SetNetworkInfo: peers connected, sync_started=true.
          First run_sync_step(): no peer heights yet → NoPeers.
          sync_wait_period() sees NoPeers → schedules next in 10 SECONDS.
T~200ms   Next SetNetworkInfo arrives WITH peer heights.
          But run_sync_step() won't fire for another ~10 seconds.
T~10.1s   run_sync_step() finally fires.
          syncing_info() → SyncNeeded → handle_sync_needed() → epoch sync → stale detected.
```

**The 10-second gap between T~100ms and T~10.1s is the bug.**

### Observed impact

| Test | Marker detection time |
|------|-----------------------|
| gc_sync_after_sync (2nd restart) | **15.1s** |
| state_sync4 | **10.6s** |
| state_sync_missing_chunks | **10.6s** |

Fast cases (1-2s) won the race condition — peer heights arrived before the first `run_sync_step()` call.

## Fix

Treat `NoPeers` as "we don't know our sync state yet, check back quickly" instead of "we're caught up":

```rust
fn sync_wait_period(&self) -> Duration {
    match self.syncing_info() {
        Ok(sync) => match sync {
            SyncRequirement::SyncNeeded { .. } => self.client.config.sync_step_period,  // 10ms
            SyncRequirement::NoPeers => self.client.config.sync_step_period,             // 10ms (was 10s!)
            SyncRequirement::AlreadyCaughtUp { .. }
            | SyncRequirement::AdvHeaderSyncDisabled => self.client.config.sync_check_period, // 10s
        },
        Err(_) => self.client.config.sync_step_period,
    }
}
```

The semantic distinction:
- **`AlreadyCaughtUp`**: We compared heights and we're caught up. Safe to wait 10s.
- **`NoPeers`**: We have NO peer height info. We don't know our sync state. Retry in 10ms.

The 10ms polling is lightweight — `syncing_info()` just filters an in-memory vector.

## Header sync before epoch sync: interaction with stale detection

### Why header sync can run before epoch sync detects staleness

Epoch sync and header sync both run inside `handle_sync_needed()`, with epoch sync checked first. However, the epoch sync check can **pass** (return "not stale"), causing the code to fall through to header sync. This happens because the epoch sync stale check uses `header_head`:

```rust
// epoch.rs:153-158
let tip_height = chain.chain_store().header_head()?.height;
if tip_height + epoch_sync_horizon_num_epochs * epoch_length >= highest_height {
    return Ok(EpochSyncRunResult::Ok);  // within horizon → no epoch sync
}
```

If `header_head + 4 * epoch_length >= highest_height`, epoch sync passes and header sync runs.

### How header sync can prevent epoch sync from ever triggering

Header sync is asynchronous — it **requests** headers in one sync cycle, and the headers arrive between cycles via `BlockHeadersResponse` → `receive_headers()` → `sync_block_headers()`, which updates `header_head` in the DB. On the next sync cycle, epoch sync sees the updated (higher) `header_head`, making it **even less likely** to trigger.

Each header sync batch can advance `header_head` by up to 512 blocks. If even a single batch succeeds, `header_head` jumps significantly, potentially pushing the node permanently inside the 4-epoch horizon — even as `highest_height` continues growing at ~1 block/second.

**Example** (node 2-3 epochs behind, `epoch_length=50`):
```
Cycle 1: header_head=200, highest=350. 200 + 4*50 = 400 >= 350 → epoch sync passes.
          Header sync runs, requests headers.
Cycle 2: Headers arrived. header_head=350. 350 + 200 = 550 >= 360 → epoch sync still passes.
          State sync triggers (head 2+ epochs behind header_head).
          → State sync WITHOUT epoch sync → broken epoch sync proof (Issue #15244)
```

### Three scenarios

| Condition | Epoch sync? | Header sync? | Outcome |
|---|---|---|---|
| `header_head + 4*epoch_length < highest_height` | Triggers → data reset | No | Correct |
| `header_head + 4*epoch_length >= highest_height` (barely) | No initially, yes later as `highest_height` grows | Runs but stalls (peer GC'd old headers) | Delayed but eventually correct |
| `header_head + 4*epoch_length >= highest_height` (comfortably, 2-3 epochs behind) | Never triggers | Succeeds, advances `header_head` further | **Bug** — state sync without epoch sync (#15244) |

The third row is the root cause of #15244. The `sync_wait_period` fix addressed the 10-second delay but does not change the epoch sync horizon check. The fundamental trigger misalignment (epoch sync horizon = 4 epochs, state sync threshold = 2 epochs) is tracked in #15244.

### Observed case: gc_sync_after_sync 2nd restart

```
header_head = 129, epoch_length = 30, highest_height = 245
129 + 4*30 = 249 >= 245 → epoch sync passes (barely within horizon)
Falls through to header sync → "initial transition to header sync"
Header sync stalls (peer GC'd headers below ~155)
Network produces blocks: highest_height grows past 249
~3 seconds later: epoch sync triggers → data reset
```

This is the "borderline" case — epoch sync eventually triggers because header sync can't advance `header_head` (GC'd headers). Total delay: ~3s waiting for `highest_height` to exceed the horizon.

## Impact beyond pytests

Any production node that briefly has no peers (e.g. after network partition recovery) would wait 10 seconds before its first sync step. With the fix, it polls every 10ms and starts syncing as soon as peer heights are available.

## Code references

- `sync_wait_period()`: `chain/client/src/client_actor.rs:1750-1764`
- `syncing_info()`: `chain/client/src/client_actor.rs:1642-1683`
- `run_sync_step()`: `chain/client/src/client_actor.rs:1769-1812`
- `handle_sync_needed()` (epoch sync runs here): `chain/client/src/sync/handler.rs:86-107`
- `EpochSync::run()` horizon check: `chain/client/src/sync/epoch.rs:153-158`
- Header sync `run()`: `chain/client/src/sync/header.rs:123-199`
- `receive_headers()` (advances header_head between sync cycles): `chain/client/src/client_actor.rs:1614-1634`
- `sync_check_period` = 10s: `core/chain-configs/src/client_config.rs:560-561`
- `sync_step_period` = 10ms: `core/chain-configs/src/client_config.rs:568-569`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sync): sync_wait_period() delays epoch sync stale detection by 10 seconds #15234

Summary

Root Cause

Timeline

Observed impact

Fix

Header sync before epoch sync: interaction with stale detection

Why header sync can run before epoch sync detects staleness

How header sync can prevent epoch sync from ever triggering

Three scenarios

Observed case: gc_sync_after_sync 2nd restart

Impact beyond pytests

Code references

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Test	Marker detection time
gc_sync_after_sync (2nd restart)	15.1s
state_sync4	10.6s
state_sync_missing_chunks	10.6s

Condition	Epoch sync?	Header sync?	Outcome
`header_head + 4*epoch_length < highest_height`	Triggers → data reset	No	Correct
`header_head + 4*epoch_length >= highest_height` (barely)	No initially, yes later as `highest_height` grows	Runs but stalls (peer GC'd old headers)	Delayed but eventually correct
`header_head + 4*epoch_length >= highest_height` (comfortably, 2-3 epochs behind)	Never triggers	Succeeds, advances `header_head` further	Bug — state sync without epoch sync (#15244)

fix(sync): sync_wait_period() delays epoch sync stale detection by 10 seconds #15234

Description

Summary

Root Cause

Timeline

Observed impact

Fix

Header sync before epoch sync: interaction with stale detection

Why header sync can run before epoch sync detects staleness

How header sync can prevent epoch sync from ever triggering

Three scenarios

Observed case: gc_sync_after_sync 2nd restart

Impact beyond pytests

Code references

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions