-
Notifications
You must be signed in to change notification settings - Fork 162
perf(l1): parallelize header download with state download during snap sync #6059
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…tory Reorganize state_healing.rs and storage_healing.rs into a shared sync/healing/ module structure with clearer naming conventions: - Create sync/healing/ directory with mod.rs, types.rs, state.rs, storage.rs - Rename MembatchEntryValue to HealingQueueEntry - Rename MembatchEntry to StorageHealingQueueEntry - Rename Membatch type to StorageHealingQueue - Rename children_not_in_storage_count to missing_children_count - Rename membatch variables to healing_queue throughout - Extract shared HealingQueueEntry and StateHealingQueue types to types.rs - Update sync.rs imports to use new healing module
Reorganize snap protocol code for better maintainability: - Split rlpx/snap.rs into rlpx/snap/ directory: - codec.rs: RLP encoding/decoding for snap messages - messages.rs: Snap protocol message types - mod.rs: Module re-exports - Split snap.rs into snap/ directory: - constants.rs: Snap sync constants and configuration - server.rs: Snap protocol server implementation - mod.rs: Module re-exports - Move snap server tests to dedicated tests/ directory - Update imports in p2p.rs, peer_handler.rs, and code_collector.rs
Document the phased approach for reorganizing snap sync code: - Phase 1: rlpx/snap module split - Phase 2: snap module split with server extraction - Phase 3: healing module unification
Split the large sync.rs (1631 lines) into focused modules: - sync/full.rs (~260 lines): Full sync implementation - sync_cycle_full(), add_blocks_in_batch(), add_blocks() - sync/snap_sync.rs (~1100 lines): Snap sync implementation - sync_cycle_snap(), snap_sync(), SnapBlockSyncState - store_block_bodies(), update_pivot(), block_is_stale() - validate_state_root(), validate_storage_root(), validate_bytecodes() - insert_accounts(), insert_storages() (both rocksdb and non-rocksdb) - sync.rs (~285 lines): Orchestration layer - Syncer struct with start_sync() and sync_cycle() - SyncMode, SyncError, AccountStorageRoots types - Re-exports for public API
…p/client.rs Move all snap protocol client-side request methods from peer_handler.rs to a dedicated snap/client.rs module: - request_account_range and request_account_range_worker - request_bytecodes - request_storage_ranges and request_storage_ranges_worker - request_state_trienodes - request_storage_trienodes Also moves related types: DumpError, RequestMetadata, SnapClientError, RequestStateTrieNodesError, RequestStorageTrieNodes. This reduces peer_handler.rs from 2,060 to 670 lines (~68% reduction), leaving it focused on ETH protocol methods (block headers/bodies). Added SnapClientError variant to SyncError for proper error handling. Updated plan_snap_sync.md to mark Phase 4 as complete.
…napError type Implement Phase 5 of snap sync refactoring plan - Error Handling. - Create snap/error.rs with unified SnapError enum covering all snap protocol errors - Update server functions (process_account_range_request, process_storage_ranges_request, process_byte_codes_request, process_trie_nodes_request) to return Result<T, SnapError> - Remove SnapClientError and RequestStateTrieNodesError, consolidate into SnapError - Keep RequestStorageTrieNodesError struct for request ID tracking in storage healing - Add From<SnapError> for PeerConnectionError to support error propagation in message handlers - Update sync module to use SyncError::Snap variant - Update healing modules (state.rs, storage.rs) to use new error types - Move DumpError struct to error.rs module - Update test return types to use SnapError - Mark Phase 5 as completed in plan document All phases of the snap sync refactoring are now complete.
Change missing_children_count from u64 to usize in HealingQueueEntry and node_missing_children function to match StorageHealingQueueEntry and be consistent with memory structure counting conventions.
Resolve conflicts from #5977 and #6018 merge to main: - Keep modular sync structure (sync.rs delegates to full.rs and snap_sync.rs) - Keep snap client code in snap/client.rs (removed from peer_handler.rs) - Add InsertingAccountRanges metric from #6018 to snap_sync.rs - Remove unused info import from peer_handler.rs
Previously, sync_cycle_snap() downloaded ALL block headers before starting state download. This caused unnecessary delays since state download is independent of header download and the pivot selection only needs recent headers. This change moves header downloading to a background task that runs in parallel with state download: - Add header_receiver and download_complete fields to SnapBlockSyncState - Create download_headers_background() function that runs header download in a separate tokio task and sends headers through an mpsc channel - Modify sync_cycle_snap() to spawn the background task and start state download immediately - Add process_pending_headers() helper to incrementally process headers at strategic points during snap_sync() - snap_sync() waits for initial headers only when needed for pivot selection This allows state download to begin much sooner, potentially reducing overall sync time significantly for mainnet syncs.
Lines of code reportTotal lines added: Detailed view |
Greptile OverviewGreptile SummaryThis PR successfully parallelizes header download with state download during snap sync by introducing a background task pattern with channel-based communication. Key Changes:
Architecture: Potential Issues:
Confidence Score: 3.5/5
|
| Filename | Overview |
|---|---|
| crates/networking/p2p/sync/snap_sync.rs | Refactors header download to run in background task parallel with state download, with proper channel-based synchronization |
Sequence Diagram
sequenceDiagram
participant Main as sync_cycle_snap (main task)
participant BG as download_headers_background (spawned task)
participant Channel as mpsc::channel
participant State as snap_sync (state download)
Main->>Main: Initialize block_sync_state
Main->>Channel: Create channel (buffer=100)
Main->>Main: Clone peers, store, blockchain
Main->>BG: Spawn background task with clones
activate BG
Main->>State: Call snap_sync immediately
activate State
par Header Download
loop Until sync_head found
BG->>BG: Request block headers
BG->>BG: Process & filter headers
alt Headers available
BG->>Channel: Send headers (non-blocking)
end
alt Sync head found or full sync needed
BG->>BG: Set download_complete flag
BG->>BG: Optionally switch to full sync
end
end
BG->>BG: Set download_complete = true
deactivate BG
and State Download
State->>State: Wait for initial headers (busy loop)
State->>Channel: try_receive_headers
State->>State: Select pivot header
loop Account & Storage Download
State->>State: request_account_range
State->>State: insert_accounts
State->>Channel: process_pending_headers (non-blocking)
State->>State: request_storage_ranges
State->>Channel: process_pending_headers (non-blocking)
State->>State: insert_storages
end
loop Healing
State->>State: heal_state_trie_wrap
State->>State: heal_storage_trie
State->>Channel: process_pending_headers (non-blocking)
end
State->>Channel: Final process_pending_headers
State-->>Main: Return
deactivate State
end
Main->>BG: Await background task completion
Main->>Main: Sync snap_enabled state from Arc
Main->>Main: Clear snap state & return
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 file reviewed, 6 comments
| ); | ||
|
|
||
| // Create channel for header communication between background task and main snap_sync | ||
| let (header_sender, header_receiver) = tokio::sync::mpsc::channel(100); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Channel buffer size of 100 could cause backpressure if header download significantly outpaces state download. Consider making buffer size configurable or calculating based on expected header batch sizes
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/snap_sync.rs
Line: 285:285
Comment:
Channel buffer size of 100 could cause backpressure if header download significantly outpaces state download. Consider making buffer size configurable or calculating based on expected header batch sizes
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| self.download_complete | ||
| .as_ref() | ||
| .map(|flag| flag.load(Ordering::Acquire)) | ||
| .unwrap_or(true) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using Ordering::Acquire here is correct for reading the flag. However, consider the semantics: when download_complete is None, returning true means "download is complete", which could be misleading since no download was started. Consider documenting this behavior or returning a different value.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/snap_sync.rs
Line: 89:93
Comment:
Using `Ordering::Acquire` here is correct for reading the flag. However, consider the semantics: when `download_complete` is `None`, returning `true` means "download is complete", which could be misleading since no download was started. Consider documenting this behavior or returning a different value.
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| while block_sync_state.block_hashes.is_empty() { | ||
| if block_sync_state.is_download_complete() { | ||
| return Err(SyncError::NoBlockHeaders); | ||
| } | ||
| // Process any available headers | ||
| process_pending_headers(block_sync_state).await?; | ||
| if block_sync_state.block_hashes.is_empty() { | ||
| debug!("Waiting for initial headers from background task..."); | ||
| tokio::time::sleep(Duration::from_millis(100)).await; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This busy-wait loop with 100ms sleeps could be inefficient. Consider using a blocking receive with timeout or a condvar-based approach for the initial header wait
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/snap_sync.rs
Line: 373:383
Comment:
This busy-wait loop with 100ms sleeps could be inefficient. Consider using a blocking receive with timeout or a condvar-based approach for the initial header wait
How can I resolve this? If you propose a fix, please make it concise.
Kimi AI Code ReviewThis PR introduces a significant change to the synchronization process of the Ethereum client by parallelizing the header download with the state download during snap sync. Here are my detailed comments and suggestions: Code Correctness and Potential Bugs
Security Vulnerabilities
Performance Implications
Rust Best Practices and Idiomatic Patterns
Memory Safety and Proper Error Handling
Code Readability and Maintainability
Ethereum-Specific Considerations
Overall, the PR introduces a significant improvement to the sync process by parallelizing header and state downloads. However, it is crucial to thoroughly test the changes, especially for potential security vulnerabilities, performance implications, and adherence to Ethereum consensus rules. Automated review by Kimi (Moonshot AI) |
The verbose debug logging added for diagnosing block body download failures was causing excessive memory usage, leading to the process being killed with exit code 137 (OOM) after 29 minutes. Removes all [HEADER_DEBUG], [BODY_DEBUG], [FULLSYNC_DEBUG], and [SNAP_TO_FULL_DEBUG] log statements.
Kimi AI Code ReviewCode Correctness and Potential Bugs
Security Vulnerabilities
Performance Implications
Rust Best Practices and Idiomatic Patterns
Memory Safety and Proper Error Handling
Code Readability and Maintainability
Ethereum-Specific Considerations
ConclusionThe PR introduces a significant improvement by parallelizing header and state download during snap sync. However, it is crucial to ensure that the implementation is correct, secure, and maintainable. The suggestions provided aim to address potential issues and improve the overall quality of the code. Automated review by Kimi (Moonshot AI) |
in the multisync monitoring script (docker_monitor.py). The sync completion logs already contain per-phase completion markers (e.g. "✓ BLOCK HEADERS complete: 25,693,009 headers in 0:29:00") but this data was not surfaced in the Slack messages or run summaries. This adds a parse_phase_timings() function that reads saved container logs and extracts timing, count, and duration for all 8 snap sync phases: Block Headers, Account Ranges, Account Insertion, Storage Ranges, Storage Insertion, State Healing, Storage Healing, and Bytecodes. The breakdown is appended to both the Slack notification (as a code block per network instance) and the text-based run log (run_history.log and per-run summary.txt). When a phase did not complete (e.g. on a failed run), it is simply omitted from the breakdown.
|
|
⏺ Here's the timeline for Run #99: Run #99 Timeline (PR #6059 - feature/background-header-download) The snap sync itself completed in 5h 25m (within normal range — previous runs were 4h
Comparison with Successful Runs In successful runs #92-#98, after snap sync the node caught up and started processing new
|
handling to fix clippy lint error
proof conversion helpers (encodable_to_proof, proof_to_encodable) from server.rs to mod.rs since they are shared utilities used by both client and server.
peer selection, extract magic numbers into named constants (ACCOUNT_RANGE_CHUNK_COUNT, STORAGE_BATCH_SIZE, HASH_MAX), remove unused contents field from DumpError and use derive Debug, rename missing_children to pending_children in healing code, and wrap process_byte_codes_request in spawn_blocking for consistency with other server handlers.
…functions so they no longer depend on the PeerHandler type as a method receiver. The three methods that used self (request_account_range, request_bytecodes, request_storage_ranges) now take peers: &mut PeerHandler as a parameter, and the two already-static methods (request_state_trienodes, request_storage_trienodes) simply move out of the impl block. Callers in snap_sync.rs, healing/state.rs, and healing/storage.rs are updated accordingly.
…ation' into feature/background-header-download
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR aims to improve L1 snap sync performance by downloading block headers concurrently with snap state download, so state sync can start immediately instead of waiting for all headers.
Changes:
- Introduces a background header-download task and incremental header processing during
snap_sync(). - Adds snap protocol server tests for Hive
AccountRangevectors. - Minor refactor in
PeerHandler::request_block_bodies_innerand updates the changelog.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
crates/networking/p2p/sync/snap_sync.rs |
Runs header download in a background task and incrementally processes headers during snap state sync. |
crates/networking/p2p/tests/snap_server_tests.rs |
Adds Hive-derived AccountRange tests and a local in-memory state fixture. |
crates/networking/p2p/peer_handler.rs |
Small refactor to store the request result before pattern matching. |
CHANGELOG.md |
Records the perf change entry. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return super::full::sync_cycle_full( | ||
| peers, | ||
| &mut peers, | ||
| blockchain, | ||
| tokio_util::sync::CancellationToken::new(), | ||
| sync_head, |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
download_headers_background() triggers full::sync_cycle_full() inside the background header task. That causes full sync to run concurrently with the ongoing snap_sync() in the main task (both touching the store and consuming peers), and it also contradicts the comment that the main task will handle the switch. Instead, the background task should only signal the mode switch (e.g., set the flag + mark download complete) and return, letting sync_cycle_snap() abort snap sync and then run full sync in a single place.
| // Handle reorg case where sync head is not reachable | ||
| let current_head = store | ||
| .get_block_header(current_head_number)? | ||
| .map(|h| h.hash()) | ||
| .unwrap_or(first_block_hash); |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reorg/stuck detection uses store.get_block_header(current_head_number) to derive current_head, but the background task is no longer persisting downloaded headers to the store. After the first batch, current_head_number will likely be beyond the local canonical head, so this lookup returns None and the check becomes ineffective/misleading. Track the last fetched head hash in a local variable (as the previous code did) or persist headers in the background task before using store-based checks.
| // Handle reorg case where sync head is not reachable | |
| let current_head = store | |
| .get_block_header(current_head_number)? | |
| .map(|h| h.hash()) | |
| .unwrap_or(first_block_hash); | |
| // Handle reorg case where sync head is not reachable. | |
| // Use the last fetched header as the current head instead of relying on the store, | |
| // since the background task does not persist headers into the store. | |
| let current_head = last_block_hash; |
| // Check if we should abort (full sync triggered) | ||
| if should_abort_snap_sync(snap_enabled) { | ||
| info!("Snap sync aborted: switching to full sync"); | ||
| return Ok(()); |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When should_abort_snap_sync() is true, snap_sync() returns Ok(()), and sync_cycle_snap() continues to clear snap state and finish successfully. This makes the mode-switch path ambiguous: if the background task is changed to only signal (and not run full sync itself), the caller will silently skip full sync. Consider returning a distinct status/error from snap_sync() (e.g., "SwitchToFullSync") and have sync_cycle_snap() perform the full sync (and skip clear_snap_state() if full sync is about to run).
| return Ok(()); | |
| // Signal the caller via an error so it does not treat this as a | |
| // successful completion and can perform the appropriate mode switch. | |
| return Err(SyncError::NoBlockHeaders); |
| // So I copied the state from a geth execution of the test suite | ||
|
|
||
| // State was trimmed to only the first 100 accounts (as the furthest account used by the tests is account 87) | ||
| // If the full 408 account state is needed check out previous commits the PR that added this code |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor grammar nit in this comment: "check out previous commits the PR" is missing a preposition (e.g., "previous commits of the PR").
| // If the full 408 account state is needed check out previous commits the PR that added this code | |
| // If the full 408 account state is needed, check out previous commits of the PR that added this code |
…nc full sync loop After snap sync completes, the node switches to full sync mode. Full sync downloads headers backward from the tip looking for a block that is_canonical_sync recognizes. This check requires entries in CANONICAL_BLOCK_HASHES (number→hash). The background header download task stored headers via add_block_headers, which only wrote to HEADERS and BLOCK_NUMBERS but not CANONICAL_BLOCK_HASHES. This caused full sync to never find a canonical ancestor, looping endlessly with "Sync failed to find target block header" until the 8h timeout. Fix: add CANONICAL_BLOCK_HASHES write to the existing transaction in add_block_headers (which has exactly one caller: snap sync background download).
|
Attaching the investigation of the bug: |
…ng bug The block_hashes field stored hashes without block numbers, so the numbers_and_hashes construction for forkchoice_update inferred block numbers from position (pivot_header.number - i). When background header download and update_pivot interleaved inserts, entries ended up out of order, causing forkchoice_update to write wrong canonical hashes. BTreeMap keyed by block number naturally handles ordering and deduplicates overlapping ranges from update_pivot re-inserting the same block numbers.
| } | ||
|
|
||
| snap_sync(peers, &store, &mut block_sync_state, datadir).await?; | ||
| download_complete.store(true, Ordering::Release); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: download_complete is only set to true in the happy path (here) and in two specific early-returns (max attempts, full-sync switch). But if the function returns via ? error propagation (e.g., peers.request_block_headers(...) fails, store.add_block_headers(...) fails, etc.), download_complete is never set to true.
When this happens, the sender half of the channel is dropped (task exits), but snap_sync()'s initial-headers wait loop (line ~435) checks download_status().is_terminal() which returns false (flag still false), so it keeps calling recv_headers_timeout() → gets None (closed channel) → loops with 100ms sleeps forever.
Suggestion: use an RAII guard pattern to ensure download_complete is always set on exit:
struct DownloadCompleteGuard(Arc<AtomicBool>);
impl Drop for DownloadCompleteGuard {
fn drop(&mut self) {
self.0.store(true, Ordering::Release);
}
}
// At the start of download_headers_background:
let _guard = DownloadCompleteGuard(download_complete.clone());Then remove the manual download_complete.store(true, ...) calls.
| ); | ||
| // We can't easily go back in the background task, so we just continue with the current head | ||
| // The update_pivot mechanism will handle this case | ||
| tokio::time::sleep(Duration::from_millis(100)).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original code handled the reorg case by walking backward: current_head = first_block_parent_hash. The background task can't do this because current_head was removed in favor of current_head_number, and going back requires knowing the parent hash.
But the current code just sleeps 100ms and retries with the same current_head_number, which will get the same headers back and loop indefinitely. The comment says "The update_pivot mechanism will handle this case" but update_pivot runs in snap_sync() on the main task — it doesn't feed back into the background task's current_head_number.
Consider either:
- Decrement
current_head_numberby 1 to walk backward (simpler, analogous to the oldparent_hashapproach) - Add a maximum retry count for this specific case
- Use
_first_block_parent_hash(currently prefixed with_) to look up the parent's block number
| .process_incoming_headers(block_headers_iter) | ||
| .await?; | ||
| // Send headers through channel (skip the first as we already have it) | ||
| if block_headers.len() > 1 && header_sender.send(block_headers).await.is_err() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first header is skipped by the receiver in process_pending_headers() (via .skip(1)), but here the full batch including the first header is sent. This means every batch sent through the channel carries one header that will be discarded. Not a bug, but it's a subtle contract — the sender and receiver must agree on the "skip first" convention.
More importantly: the original code only skipped the first header of the first batch (to avoid the already-known current head). Here, .skip(1) is applied to every batch received from the channel. If the background task sends multiple batches, the first header of each batch (which is the continuation point from the previous batch's last header) gets silently dropped. This could cause gaps in block_hashes — missing one header per batch.
Double-check that the header fetching always returns overlapping ranges (i.e., the first header of batch N+1 == the last header of batch N). If so, the skip is correct. If not, this introduces gaps.
| block_sync_state.set_header_channel(header_receiver, download_complete.clone()); | ||
|
|
||
| // Create Arc wrapper for snap_enabled so we can share it with the background task | ||
| let snap_enabled_arc = Arc::new(AtomicBool::new(snap_enabled.load(Ordering::Relaxed))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Creating a separate Arc<AtomicBool> copy of snap_enabled means the Syncer's original snap_enabled (shared with SyncManager) doesn't see the full-sync switch until snap_sync() completes and the sync-back on line ~389 runs. If SyncManager or other code checks snap_enabled concurrently during the sync cycle, it'll still see true even though the background task decided to switch to full sync.
In practice this is probably fine since the sync cycle is the only writer, but it's worth a comment explaining why a copy is used instead of sharing the original Arc.
Summary
Vec<H256>withBTreeMap<u64, H256>Motivation
Previously,
sync_cycle_snap()downloaded ALL block headers before starting state download. This caused unnecessary delays since:Additionally, the
block_hashes: Vec<H256>field stored hashes without block numbers. Thenumbers_and_hashesconstruction forforkchoice_updateinferred block numbers from position (pivot_header.number - i). When the background header download andupdate_pivotinterleaved inserts, entries ended up out of order, causingforkchoice_updateto write wrong canonical hashes.Implementation
header_receiveranddownload_completefields toSnapBlockSyncStatedownload_headers_background()function that runs header download in a separate tokio tasksync_cycle_snap()to spawn the background task and start state download immediatelyprocess_pending_headers()helper to incrementally process headers at strategic pointssnap_sync()waits for initial headers only when needed for pivot selectionblock_hashes: Vec<H256>withBTreeMap<u64, H256>keyed by block number — this naturally handles ordering and deduplicates overlapping ranges fromupdate_pivotre-inserting the same block numbersnumbers_and_hashesdirectly from BTreeMap keys instead of inferring from positionTest plan