-
Notifications
You must be signed in to change notification settings - Fork 162
perf(l1): overlap header downloads with state sync #6194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Fetch the pivot header by hash directly and start state sync immediately. Headers download runs as a background tokio task concurrently with state sync, removing 15-20 minutes of header download from the critical path on mainnet. Key changes: - sync_cycle_snap: fetch pivot early via request_pivot_header, spawn download_headers_background, call snap_sync immediately - snap_sync/update_pivot: block_sync_state is now Arc<Mutex<>> shared between background download and state sync - snap_sync awaits header_download_handle before finalization - Added PeerHandler::get_block_header_by_hash for single-header requests
🤖 Kimi Code ReviewReview SummaryThis PR introduces significant changes to snap sync by decoupling header downloads from state sync. While the approach is sound, there are several critical issues that need addressing: Critical Issues
Security & Correctness Issues
Performance & Best Practices
Suggested Fixes
The overall architecture change is good, but these issues need addressing before merge. Automated review by Kimi (Moonshot AI) · custom prompt |
🤖 Codex Code ReviewFindings (ordered by severity)
Other notes
Tests / validation
If you want, I can propose a small refactor to avoid holding the mutex across awaits and add a defensive hash check. Automated review by OpenAI Codex · custom prompt |
🤖 Claude Code ReviewNow let me compile my review findings: Code Review: PR #6194 - Overlap header downloads with state syncSummaryThis PR restructures snap sync to download headers in the background concurrently with state sync, removing headers from the critical path. The approach is sound and the implementation is generally well-done. However, there are several issues that need attention. Critical Issues1. Potential Race Condition with current_head_number (snap_sync.rs:296-297)current_head = last_block_hash;
current_head_number = last_block_number;In Recommendation: Verify that 2. Missing Validation of Pivot Header (snap_sync.rs:124-129)let pivot_header = request_pivot_header(peers, sync_head).await?;
info!(
"Obtained pivot header #{} (hash: {:?})",
pivot_header.number,
pivot_header.hash()
);The pivot header fetched by hash is not validated. A malicious peer could return:
Recommendation: Add validation: let pivot_header = request_pivot_header(peers, sync_head).await?;
if pivot_header.hash() != sync_head {
return Err(SyncError::InvalidPivotHeader);
}
// Add additional consensus validation here3. Inconsistent Error Handling in Background Task (snap_sync.rs:249-251)if attempts > MAX_HEADER_FETCH_ATTEMPTS {
warn!("Background header download: failed to find target, aborting");
return Ok(()); // Returns Ok even on failure!
}The background task returns Recommendation: Return an error instead: if attempts > MAX_HEADER_FETCH_ATTEMPTS {
return Err(SyncError::NoBlockHeaders);
}Major Issues4. PeerHandler Cloning May Be Expensive (snap_sync.rs:159)let mut bg_peers = peers.clone();
Recommendation: Consider using 5. Potential Memory Issue: Unbounded block_hashes Vector (snap_sync.rs:54, 304)pub struct SnapBlockSyncState {
pub block_hashes: Vec<H256>,
store: Store,
}For mainnet (24.4M headers), Recommendation: Consider paginating or using a more memory-efficient structure, or at least document this memory requirement. 6. Missing Cancellation Mechanism (snap_sync.rs:162-173)The background task has no way to be cancelled if the main sync encounters an error or needs to abort. If Recommendation: Use a let cancel_token = CancellationToken::new();
let header_handle = {
let token = cancel_token.clone();
tokio::spawn(async move {
tokio::select! {
result = download_headers_background(...) => result,
_ = token.cancelled() => Ok(()),
}
})
};Minor Issues7. Duplicate Log Message (snap_sync.rs:309, 643)The message "Background header download complete" appears twice - once from the background task and once in 8. Unclear Variable Naming (snap_sync.rs:150-154)let mut block_sync_state = block_sync_state;
block_sync_state.block_hashes.push(pivot_header.hash());
let block_sync_state = Arc::new(tokio::sync::Mutex::new(block_sync_state));The repeated shadowing of Recommendation: let mut block_sync_state = block_sync_state;
block_sync_state.block_hashes.push(pivot_header.hash());
let shared_block_sync_state = Arc::new(tokio::sync::Mutex::new(block_sync_state));9. Missing Error Context in peer_handler.rs (peer_handler.rs:652-654)if let Some(header) = block_headers.into_iter().next() {
return Ok(Some(header));
}When a peer returns an empty Recommendation: Add logging: if let Some(header) = block_headers.into_iter().next() {
return Ok(Some(header));
} else {
debug!("Peer returned empty headers for hash {block_hash:?}");
}10. Potential Off-by-One in Header Download (snap_sync.rs:299-300)if block_headers.len() > 1 {
let block_headers_iter = block_headers.into_iter().skip(1);The code skips the first header assuming it's already stored, but in the background download flow:
Need to verify that Performance Considerations11. Lock Contention on block_sync_state (snap_sync.rs:301-305, 757-761)The
This could cause contention if both try to lock simultaneously. Consider using Code Quality12. Inconsistent Comment Style (snap_sync.rs:280, 287)Some comments use 13. Magic Number in Retry Logic (snap_sync.rs:214-216)tokio::time::sleep(Duration::from_millis(
1.1_f64.powf(attempts as f64) as u64,
))The exponential backoff constant Testing Recommendations
Positive Aspects✅ Clear separation of concerns with dedicated functions RecommendationRequest Changes - The critical issues (#2 and #3) must be fixed before merge. Issue #4 should be carefully evaluated as it could cause peer management problems in production. Automated review by Claude (Anthropic) · custom prompt |
Lines of code reportTotal lines added: Detailed view |
Greptile OverviewGreptile SummaryThis PR restructures snap sync to download headers in a background task concurrently with state sync, eliminating 15-20 minutes from the critical path on mainnet. Key changes:
Critical issue found:
Confidence Score: 2/5
|
| Filename | Overview |
|---|---|
| crates/networking/p2p/sync/snap_sync.rs | Refactored to overlap header downloads with state sync using background task and Arc<Mutex<>>. Critical bug: checkpoint restoration doesn't update block number (line 238-240). |
| crates/networking/p2p/peer_handler.rs | Added get_block_header_by_hash method to fetch single header by hash. Clean implementation, follows existing patterns. |
| crates/networking/p2p/snap/client.rs | Updated request_account_range signature to accept Arc<Mutex> for concurrent access. Minimal change, correct. |
Sequence Diagram
sequenceDiagram
participant Main as sync_cycle_snap
participant Pivot as request_pivot_header
participant BG as download_headers_background
participant State as snap_sync (state download)
participant Store as Store
participant Peers as PeerHandler
Main->>Store: get current_head
Main->>Pivot: Fetch pivot header by hash
Pivot->>Peers: get_block_header_by_hash(sync_head)
Peers-->>Pivot: pivot_header
Pivot-->>Main: pivot_header
Main->>Store: add_block_headers([pivot_header])
Main->>Main: Wrap block_sync_state in Arc<Mutex<>>
par Background Header Download
Main->>BG: tokio::spawn(download_headers_background)
BG->>Store: get_header_download_checkpoint()
loop Until sync_head found
BG->>Peers: request_block_headers(current_head_number, sync_head)
Peers-->>BG: block_headers batch
BG->>Store: process_incoming_headers (via lock)
end
BG-->>Main: header_handle (awaited later)
and Concurrent State Sync
Main->>State: snap_sync(header_download_handle)
loop State download & healing
State->>Peers: request_account_range
State->>Peers: request_storage_ranges
State->>Peers: request_bytecodes
Note over State: May call update_pivot which locks block_sync_state
end
State->>State: Await header_download_handle
State->>Store: store_block_bodies
State->>Store: forkchoice_update
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 files reviewed, 1 comment
| if let Some(head) = store.get_header_download_checkpoint().await? { | ||
| current_head = head; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
current_head_number not updated after checkpoint restoration
When the checkpoint is restored, current_head (hash) is updated but current_head_number remains at the original value. This causes request_block_headers(current_head_number, sync_head) on line 246 to restart from the wrong block number.
| if let Some(head) = store.get_header_download_checkpoint().await? { | |
| current_head = head; | |
| } | |
| if let Some(head) = store.get_header_download_checkpoint().await? { | |
| current_head = head; | |
| current_head_number = store.get_block_number(head).await?.ok_or(SyncError::BlockNumber(head))?; | |
| } |
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/snap_sync.rs
Line: 238:240
Comment:
`current_head_number` not updated after checkpoint restoration
When the checkpoint is restored, `current_head` (hash) is updated but `current_head_number` remains at the original value. This causes `request_block_headers(current_head_number, sync_head)` on line 246 to restart from the wrong block number.
```suggestion
if let Some(head) = store.get_header_download_checkpoint().await? {
current_head = head;
current_head_number = store.get_block_number(head).await?.ok_or(SyncError::BlockNumber(head))?;
}
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR restructures L1 snap sync to overlap header downloads with state sync, aiming to remove 15–20 minutes of sequential header download time from the critical path (mainnet).
Changes:
- Fetches the pivot header directly by hash before starting state sync.
- Spawns a Tokio background task to download the remaining headers concurrently with state sync and awaits it before finalization.
- Updates snap client/state-sync plumbing to share
SnapBlockSyncStateviaArc<tokio::sync::Mutex<...>>, and adds a peer API for fetching a single header by hash.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 9 comments.
| File | Description |
|---|---|
| crates/networking/p2p/sync/snap_sync.rs | Refactors snap sync cycle to fetch pivot header upfront, download headers in a background task, and share sync state via Arc<Mutex<...>>. |
| crates/networking/p2p/snap/client.rs | Updates request_account_range signature to accept the shared Arc<Mutex<SnapBlockSyncState>>. |
| crates/networking/p2p/peer_handler.rs | Adds get_block_header_by_hash() to request a single header by hash from a specific peer. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Store the pivot header and add it to block_sync_state | ||
| store | ||
| .add_block_headers(vec![pivot_header.clone()]) | ||
| .await?; | ||
| let mut block_sync_state = block_sync_state; | ||
| block_sync_state.block_hashes.push(pivot_header.hash()); | ||
|
|
||
| // Wrap in Arc<Mutex<>> for sharing between background download and state sync | ||
| let block_sync_state = Arc::new(tokio::sync::Mutex::new(block_sync_state)); | ||
|
|
Copilot
AI
Feb 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
block_hashes is initialized with the pivot hash and then concurrently extended by the background downloader via process_incoming_headers. Since headers are downloaded forward from current_head_number, appending them after the pivot will (a) reorder the chain incorrectly and (b) likely duplicate the pivot header when the background task reaches sync_head. This will break the numbers_and_hashes computation later (duplicate hashes mapped to the wrong block numbers) and can also cause snap sync to select the wrong pivot if the background task appends before snap_sync reads .last(). Consider keeping pivot_hash separate from block_hashes (or ensuring the background download builds block_hashes in canonical order ending with the pivot and never duplicates it).
| let mut current_head = { | ||
| let bss = block_sync_state.lock().await; | ||
| bss.store | ||
| .get_latest_canonical_block_hash() | ||
| .await? | ||
| .unwrap_or(H256::zero()) | ||
| }; |
Copilot
AI
Feb 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
download_headers_background acquires block_sync_state.lock().await and then performs an async DB call (get_latest_canonical_block_hash().await) while holding the mutex. Holding a Tokio mutex across .await can cause unnecessary contention and can block update_pivot/state sync from making progress. Prefer reading from the store argument (no lock needed) or move the DB await outside the critical section.
| let mut current_head = { | |
| let bss = block_sync_state.lock().await; | |
| bss.store | |
| .get_latest_canonical_block_hash() | |
| .await? | |
| .unwrap_or(H256::zero()) | |
| }; | |
| let mut current_head = store | |
| .get_latest_canonical_block_hash() | |
| .await? | |
| .unwrap_or(H256::zero()); |
| if block_headers.len() > 1 { | ||
| let block_headers_iter = block_headers.into_iter().skip(1); | ||
|
|
||
| block_sync_state | ||
| .lock() | ||
| .await | ||
| .process_incoming_headers(block_headers_iter) | ||
| .await?; | ||
| } |
Copilot
AI
Feb 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
process_incoming_headers(...).await is executed while holding block_sync_state.lock().await. Since process_incoming_headers awaits on multiple store writes, this mutex can be held for a long time, blocking concurrent pivot updates and any reads of block_hashes. Consider restructuring so the mutex is only held for in-memory updates (e.g., compute headers/hashes first, write headers/checkpoint to store outside the lock, then briefly lock to update block_hashes).
| let pivot_hash = block_sync_state | ||
| .lock() | ||
| .await | ||
| .block_hashes | ||
| .last() | ||
| .copied() | ||
| .ok_or(SyncError::NoBlockHeaders)?; | ||
| let mut pivot_header = store |
Copilot
AI
Feb 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
snap_sync derives pivot_hash from block_sync_state.block_hashes.last(), but block_hashes is mutated concurrently by the background header download task. Depending on timing, .last() can become an intermediate header rather than sync_head, causing state sync to target the wrong pivot. Use a stable pivot identifier (e.g., pass sync_head/pivot_hash explicitly into snap_sync, or store the pivot hash in a dedicated field that the background task does not modify).
| let bss = block_sync_state.clone(); | ||
| let mut bg_peers = peers.clone(); | ||
| let bg_store = store.clone(); | ||
| let pending_block = store.get_pending_block(sync_head).await.ok().flatten(); |
Copilot
AI
Feb 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
store.get_pending_block(sync_head).await.ok().flatten() drops any storage error and silently treats it as “no pending block”. If the DB read fails, header download will proceed with potentially inconsistent state and the root cause is lost. Propagate the error (or at least log it and fail the sync cycle) instead of swallowing it.
| let pending_block = store.get_pending_block(sync_head).await.ok().flatten(); | |
| let pending_block = store.get_pending_block(sync_head).await?; |
| // Spawn background header download — runs concurrently with state sync | ||
| let header_handle = { | ||
| let bss = block_sync_state.clone(); | ||
| let mut bg_peers = peers.clone(); | ||
| let bg_store = store.clone(); | ||
| let pending_block = store.get_pending_block(sync_head).await.ok().flatten(); | ||
| tokio::spawn(async move { | ||
| download_headers_background( | ||
| &mut bg_peers, | ||
| &bg_store, | ||
| &bss, | ||
| current_head_number, | ||
| sync_head, | ||
| pending_block.map(|b| b.header), | ||
| ) | ||
| .await | ||
| }) | ||
| }; | ||
|
|
||
| let mut attempts = 0; | ||
| info!("Background header download started — proceeding with state sync"); | ||
| snap_sync(peers, &store, &block_sync_state, datadir, header_handle).await?; |
Copilot
AI
Feb 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If snap_sync(...) returns early with an error before awaiting header_handle, the spawned background header task will keep running and mutating the DB/checkpoints even though the sync cycle failed. Consider adding a cancellation mechanism (e.g., CancellationToken passed into download_headers_background) and aborting/joining the task on error, so failed sync attempts don't leave a stray downloader running.
| if attempts > MAX_HEADER_FETCH_ATTEMPTS { | ||
| warn!("Sync failed to find target block header, aborting"); | ||
| warn!("Background header download: failed to find target, aborting"); | ||
| return Ok(()); |
Copilot
AI
Feb 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When header fetching exceeds MAX_HEADER_FETCH_ATTEMPTS, the background task logs a warning and returns Ok(()). Since snap_sync treats the JoinHandle result as success, finalization may run without a complete header chain. This should return an error (e.g., SyncError::NoBlockHeaders) so the sync cycle fails fast instead of silently proceeding in an invalid state.
| return Ok(()); | |
| return Err(SyncError::NoBlockHeaders); |
| if first_block_hash == last_block_hash | ||
| && first_block_hash == current_head | ||
| && current_head != sync_head | ||
| { | ||
| // There is no path to the sync head this goes back until it find a common ancerstor | ||
| warn!("Sync failed to find target block header, going back to the previous parent"); | ||
| warn!("Background header download: going back to previous parent"); | ||
| current_head = first_block_parent_hash; | ||
| continue; |
Copilot
AI
Feb 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the “no path to sync head” fallback, current_head is set to first_block_parent_hash but current_head_number is not updated. Since the next request is driven by current_head_number (not the hash), this fallback may not actually move the search window and can lead to repeated requests/loops. Use first_block_number to decrement current_head_number (and keep hash/number consistent), or refactor to drive the request by hash instead of number for this recovery path.
| if let Some(header) = block_headers.into_iter().next() { | ||
| return Ok(Some(header)); |
Copilot
AI
Feb 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_block_header_by_hash returns the first header the peer sends without verifying it matches block_hash. A malicious/buggy peer could return a different header, and snap sync would accept it as the pivot. Validate header.hash() == block_hash (and ideally that exactly one header was returned) before returning Some(header), otherwise treat it as a failure.
| if let Some(header) = block_headers.into_iter().next() { | |
| return Ok(Some(header)); | |
| if block_headers.len() == 1 { | |
| let header = block_headers.into_iter().next().expect("len checked above"); | |
| if header.hash() == block_hash { | |
| return Ok(Some(header)); | |
| } else { | |
| debug!( | |
| "Received block header with mismatching hash from peer {peer_id:?}: \ | |
| expected {block_hash:?}, got {:?}", | |
| header.hash() | |
| ); | |
| } | |
| } else { | |
| debug!( | |
| "Expected exactly one block header from peer {peer_id:?} for hash \ | |
| {block_hash:?}, but received {} headers", | |
| block_headers.len() | |
| ); |
…omote set_normal_mode to error
Headers now download in background concurrently with state sync, so the old sequential Phase 1/8 model is wrong. Changes: - Remove phase numbers from display (phases can repeat due to healing loops) - Add headers_downloading AtomicBool flag to track background download state - Show compact one-liner header progress inline with each state sync update - Log header completion once when background download finishes - Split download_headers_background into wrapper + inner to ensure the flag is always cleared even on error paths
Benchmark Block Execution Results Comparison Against Main
|
| .add_block_headers(vec![pivot_header.clone()]) | ||
| .await?; | ||
| let mut block_sync_state = block_sync_state; | ||
| block_sync_state.block_hashes.push(pivot_header.hash()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pivot hash is pushed here before spawning the background download. But download_headers_background_inner will also push it when it reaches sync_head — the process_incoming_headers call at line 333 extends block_hashes with all received headers including the sync_head.
This means the pivot hash appears twice in block_hashes. The forkchoice_update at line 684 computes pivot_header.number - i for each entry, so the duplicate shifts all preceding block numbers by one — every header before the pivot gets mapped to the wrong block number.
Fix: either skip pushing here and let the background task handle it, or add a guard in process_incoming_headers / the background loop to skip the sync_head hash if it's already present.
| let mut current_head = { | ||
| let bss = block_sync_state.lock().await; | ||
| bss.store | ||
| .get_latest_canonical_block_hash() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
current_head (hash) is re-derived here from get_latest_canonical_block_hash() + checkpoint, but current_head_number was computed by the caller from a different current_head value (the one from block_sync_state.get_current_head()). Between those two reads, sync_cycle_snap stored the pivot header via add_block_headers, which could change what get_latest_canonical_block_hash() returns.
If the re-derived current_head hash doesn't match current_head_number, the first request_block_headers(current_head_number, sync_head) call starts from the wrong position.
Consider passing both current_head and current_head_number from the caller instead of re-deriving the hash.
| // and forkchoice_update, which need the full header chain. | ||
| info!("Waiting for background header download to complete..."); | ||
| header_download_handle.await??; | ||
| info!("Background header download complete"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This duplicates the log at line 338 inside download_headers_background_inner. Since header_download_handle.await?? only succeeds when the background task completed successfully (which already logged "complete"), this second message is redundant. Consider removing one of them or changing this to a debug-level log like debug!("Header download handle joined successfully").
Motivation
ethrex downloads ALL 24.4M headers sequentially (15-20 min on mainnet) before starting state sync. Nethermind runs header download concurrently with state sync, removing headers from the critical path entirely.
Description
Restructure
sync_cycle_snap()to overlap header downloads with state sync:request_pivot_header()fetches the sync_head header directly viaget_block_header_by_hash()— no need to download the full chain firstdownload_headers_background()runs as a tokio task, downloading remaining headers concurrently with state syncblock_sync_statewrapped inArc<tokio::sync::Mutex<>>shared between background download and state sync (forupdate_pivotcalls)snap_sync()awaits the header download handle beforestore_block_bodiesandforkchoice_update, which need the full header chainOn mainnet, state sync takes 30+ minutes — headers (15-20 min) complete well before state sync finishes, making headers essentially "free".
Files changed
snap_sync.rs: Restructuredsync_cycle_snap, newrequest_pivot_header, newdownload_headers_background, updatedsnap_syncandupdate_pivotsignaturespeer_handler.rs: Addedget_block_header_by_hash()methodsnap/client.rs: Updatedrequest_account_rangeto acceptArc<Mutex<SnapBlockSyncState>>How to Test