perf(l1): parallelize header download with state download during snap sync #6059

pablodeymo · 2026-01-28T21:36:54Z

Summary

Moves header downloading to a background task that runs in parallel with state download
State download can now begin immediately instead of waiting for all headers to be downloaded
Adds incremental header processing at strategic points during snap_sync()
Fixes block_hashes ordering bug by replacing Vec<H256> with BTreeMap<u64, H256>

Motivation

Previously, sync_cycle_snap() downloaded ALL block headers before starting state download. This caused unnecessary delays since:

State download is independent of header download
Pivot selection only needs recent headers, not all of them
For mainnet with millions of blocks, header download alone takes significant time

Additionally, the block_hashes: Vec<H256> field stored hashes without block numbers. The numbers_and_hashes construction for forkchoice_update inferred block numbers from position (pivot_header.number - i). When the background header download and update_pivot interleaved inserts, entries ended up out of order, causing forkchoice_update to write wrong canonical hashes.

Implementation

Add header_receiver and download_complete fields to SnapBlockSyncState
Create download_headers_background() function that runs header download in a separate tokio task
Modify sync_cycle_snap() to spawn the background task and start state download immediately
Add process_pending_headers() helper to incrementally process headers at strategic points
snap_sync() waits for initial headers only when needed for pivot selection
Replace block_hashes: Vec<H256> with BTreeMap<u64, H256> keyed by block number — this naturally handles ordering and deduplicates overlapping ranges from update_pivot re-inserting the same block numbers
Build numbers_and_hashes directly from BTreeMap keys instead of inferring from position

Test plan

Run snap sync on testnet and verify parallel execution via logs
Verify state download starts before header download completes
Check memory usage to ensure channel doesn't buffer too many headers
Compare sync time before/after the change
Verify forkchoice_update writes correct canonical hashes after snap sync

…tory Reorganize state_healing.rs and storage_healing.rs into a shared sync/healing/ module structure with clearer naming conventions: - Create sync/healing/ directory with mod.rs, types.rs, state.rs, storage.rs - Rename MembatchEntryValue to HealingQueueEntry - Rename MembatchEntry to StorageHealingQueueEntry - Rename Membatch type to StorageHealingQueue - Rename children_not_in_storage_count to missing_children_count - Rename membatch variables to healing_queue throughout - Extract shared HealingQueueEntry and StateHealingQueue types to types.rs - Update sync.rs imports to use new healing module

Reorganize snap protocol code for better maintainability: - Split rlpx/snap.rs into rlpx/snap/ directory: - codec.rs: RLP encoding/decoding for snap messages - messages.rs: Snap protocol message types - mod.rs: Module re-exports - Split snap.rs into snap/ directory: - constants.rs: Snap sync constants and configuration - server.rs: Snap protocol server implementation - mod.rs: Module re-exports - Move snap server tests to dedicated tests/ directory - Update imports in p2p.rs, peer_handler.rs, and code_collector.rs

Document the phased approach for reorganizing snap sync code: - Phase 1: rlpx/snap module split - Phase 2: snap module split with server extraction - Phase 3: healing module unification

Split the large sync.rs (1631 lines) into focused modules: - sync/full.rs (~260 lines): Full sync implementation - sync_cycle_full(), add_blocks_in_batch(), add_blocks() - sync/snap_sync.rs (~1100 lines): Snap sync implementation - sync_cycle_snap(), snap_sync(), SnapBlockSyncState - store_block_bodies(), update_pivot(), block_is_stale() - validate_state_root(), validate_storage_root(), validate_bytecodes() - insert_accounts(), insert_storages() (both rocksdb and non-rocksdb) - sync.rs (~285 lines): Orchestration layer - Syncer struct with start_sync() and sync_cycle() - SyncMode, SyncError, AccountStorageRoots types - Re-exports for public API

…p/client.rs Move all snap protocol client-side request methods from peer_handler.rs to a dedicated snap/client.rs module: - request_account_range and request_account_range_worker - request_bytecodes - request_storage_ranges and request_storage_ranges_worker - request_state_trienodes - request_storage_trienodes Also moves related types: DumpError, RequestMetadata, SnapClientError, RequestStateTrieNodesError, RequestStorageTrieNodes. This reduces peer_handler.rs from 2,060 to 670 lines (~68% reduction), leaving it focused on ETH protocol methods (block headers/bodies). Added SnapClientError variant to SyncError for proper error handling. Updated plan_snap_sync.md to mark Phase 4 as complete.

…napError type Implement Phase 5 of snap sync refactoring plan - Error Handling. - Create snap/error.rs with unified SnapError enum covering all snap protocol errors - Update server functions (process_account_range_request, process_storage_ranges_request, process_byte_codes_request, process_trie_nodes_request) to return Result<T, SnapError> - Remove SnapClientError and RequestStateTrieNodesError, consolidate into SnapError - Keep RequestStorageTrieNodesError struct for request ID tracking in storage healing - Add From<SnapError> for PeerConnectionError to support error propagation in message handlers - Update sync module to use SyncError::Snap variant - Update healing modules (state.rs, storage.rs) to use new error types - Move DumpError struct to error.rs module - Update test return types to use SnapError - Mark Phase 5 as completed in plan document All phases of the snap sync refactoring are now complete.

Change missing_children_count from u64 to usize in HealingQueueEntry and node_missing_children function to match StorageHealingQueueEntry and be consistent with memory structure counting conventions.

Resolve conflicts from #5977 and #6018 merge to main: - Keep modular sync structure (sync.rs delegates to full.rs and snap_sync.rs) - Keep snap client code in snap/client.rs (removed from peer_handler.rs) - Add InsertingAccountRanges metric from #6018 to snap_sync.rs - Remove unused info import from peer_handler.rs

Previously, sync_cycle_snap() downloaded ALL block headers before starting state download. This caused unnecessary delays since state download is independent of header download and the pivot selection only needs recent headers. This change moves header downloading to a background task that runs in parallel with state download: - Add header_receiver and download_complete fields to SnapBlockSyncState - Create download_headers_background() function that runs header download in a separate tokio task and sends headers through an mpsc channel - Modify sync_cycle_snap() to spawn the background task and start state download immediately - Add process_pending_headers() helper to incrementally process headers at strategic points during snap_sync() - snap_sync() waits for initial headers only when needed for pivot selection This allows state download to begin much sooner, potentially reducing overall sync time significantly for mainnet syncs.

github-actions · 2026-01-28T21:39:14Z

Lines of code report

Total lines added: 167
Total lines removed: 0
Total lines changed: 167

Detailed view

+------------------------------------------------+-------+------+
| File                                           | Lines | Diff |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/peer_handler.rs   | 547   | +1   |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/snap_sync.rs | 1110  | +165 |
+------------------------------------------------+-------+------+
| ethrex/crates/storage/store.rs                 | 2429  | +1   |
+------------------------------------------------+-------+------+

greptile-apps · 2026-01-28T21:40:48Z

Greptile Overview

Greptile Summary

This PR successfully parallelizes header download with state download during snap sync by introducing a background task pattern with channel-based communication.

Key Changes:

Removed Clone derive from SnapBlockSyncState and added channel receiver and completion flag fields
Created download_headers_background() function that runs header download in a separate tokio task
Modified sync_cycle_snap() to spawn the background task immediately and proceed with state download without waiting
Added process_pending_headers() helper called at strategic points during state download to incrementally process headers
snap_sync() waits for initial headers in a loop before selecting pivot, then periodically processes incoming headers

Architecture:
The implementation uses proper async patterns with tokio channels (buffer=100) and Arc<AtomicBool> for synchronization. The background task sends header batches through the channel, while the main task non-blockingly receives them at key points.

Potential Issues:

One logic issue: potential error propagation problem in background task when fetching current header (line 179)
Minor inefficiency: busy-wait loop for initial headers uses 100ms sleeps instead of blocking receive
Suboptimal: when background task switches to full sync, main task continues snap sync unnecessarily
Channel buffer size (100) is hardcoded and may need tuning based on actual workloads

Confidence Score: 3.5/5

Safe to merge with minor issues - core parallelization logic is sound but has optimization opportunities
The refactoring successfully achieves the stated goal of parallelizing header and state downloads. The synchronization primitives (channels, atomic flags) are used correctly. However, there's one potential logic issue with error handling in the background task, and several areas where efficiency could be improved (busy-wait loops, unnecessary work when switching sync modes). The changes are well-structured but would benefit from the suggested improvements.
No files require special attention beyond the comments provided

Important Files Changed

Filename	Overview
crates/networking/p2p/sync/snap_sync.rs	Refactors header download to run in background task parallel with state download, with proper channel-based synchronization

Sequence Diagram

sequenceDiagram
    participant Main as sync_cycle_snap (main task)
    participant BG as download_headers_background (spawned task)
    participant Channel as mpsc::channel
    participant State as snap_sync (state download)
    
    Main->>Main: Initialize block_sync_state
    Main->>Channel: Create channel (buffer=100)
    Main->>Main: Clone peers, store, blockchain
    Main->>BG: Spawn background task with clones
    activate BG
    Main->>State: Call snap_sync immediately
    activate State
    
    par Header Download
        loop Until sync_head found
            BG->>BG: Request block headers
            BG->>BG: Process & filter headers
            alt Headers available
                BG->>Channel: Send headers (non-blocking)
            end
            alt Sync head found or full sync needed
                BG->>BG: Set download_complete flag
                BG->>BG: Optionally switch to full sync
            end
        end
        BG->>BG: Set download_complete = true
        deactivate BG
    and State Download
        State->>State: Wait for initial headers (busy loop)
        State->>Channel: try_receive_headers
        State->>State: Select pivot header
        loop Account & Storage Download
            State->>State: request_account_range
            State->>State: insert_accounts
            State->>Channel: process_pending_headers (non-blocking)
            State->>State: request_storage_ranges
            State->>Channel: process_pending_headers (non-blocking)
            State->>State: insert_storages
        end
        loop Healing
            State->>State: heal_state_trie_wrap
            State->>State: heal_storage_trie
            State->>Channel: process_pending_headers (non-blocking)
        end
        State->>Channel: Final process_pending_headers
        State-->>Main: Return
        deactivate State
    end
    
    Main->>BG: Await background task completion
    Main->>Main: Sync snap_enabled state from Arc
    Main->>Main: Clear snap state & return

greptile-apps

_{1 file reviewed, 6 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-28T21:40:52Z

crates/networking/p2p/sync/snap_sync.rs

+    );
+
+    // Create channel for header communication between background task and main snap_sync
+    let (header_sender, header_receiver) = tokio::sync::mpsc::channel(100);


Channel buffer size of 100 could cause backpressure if header download significantly outpaces state download. Consider making buffer size configurable or calculating based on expected header batch sizes

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/networking/p2p/sync/snap_sync.rs Line: 285:285 Comment: Channel buffer size of 100 could cause backpressure if header download significantly outpaces state download. Consider making buffer size configurable or calculating based on expected header batch sizes <sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub> How can I resolve this? If you propose a fix, please make it concise.

crates/networking/p2p/sync/snap_sync.rs

greptile-apps · 2026-01-28T21:40:54Z

crates/networking/p2p/sync/snap_sync.rs

+        self.download_complete
+            .as_ref()
+            .map(|flag| flag.load(Ordering::Acquire))
+            .unwrap_or(true)
+    }


Using Ordering::Acquire here is correct for reading the flag. However, consider the semantics: when download_complete is None, returning true means "download is complete", which could be misleading since no download was started. Consider documenting this behavior or returning a different value.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/networking/p2p/sync/snap_sync.rs Line: 89:93 Comment: Using `Ordering::Acquire` here is correct for reading the flag. However, consider the semantics: when `download_complete` is `None`, returning `true` means "download is complete", which could be misleading since no download was started. Consider documenting this behavior or returning a different value. <sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub> How can I resolve this? If you propose a fix, please make it concise.

crates/networking/p2p/sync/snap_sync.rs

greptile-apps · 2026-01-28T21:40:57Z

crates/networking/p2p/sync/snap_sync.rs

+    while block_sync_state.block_hashes.is_empty() {
+        if block_sync_state.is_download_complete() {
+            return Err(SyncError::NoBlockHeaders);
+        }
+        // Process any available headers
+        process_pending_headers(block_sync_state).await?;
+        if block_sync_state.block_hashes.is_empty() {
+            debug!("Waiting for initial headers from background task...");
+            tokio::time::sleep(Duration::from_millis(100)).await;
+        }
+    }


This busy-wait loop with 100ms sleeps could be inefficient. Consider using a blocking receive with timeout or a condvar-based approach for the initial header wait

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/networking/p2p/sync/snap_sync.rs Line: 373:383 Comment: This busy-wait loop with 100ms sleeps could be inefficient. Consider using a blocking receive with timeout or a condvar-based approach for the initial header wait How can I resolve this? If you propose a fix, please make it concise.

github-actions · 2026-02-03T14:37:24Z

Kimi AI Code Review

This PR introduces a significant change to the synchronization process of the Ethereum client by parallelizing the header download with the state download during snap sync. Here are my detailed comments and suggestions:

Code Correctness and Potential Bugs

Logging and Debugging: The addition of debug logs is a good practice for tracing the flow and understanding the behavior of the code. However, ensure that these logs do not introduce any performance overhead in production environments. [Lines 393, 402, 419, etc.]
Error Handling: The changes in error handling for the block bodies request seem appropriate. However, it would be beneficial to also handle the case where the number of received block bodies does not match the number of requested hashes, potentially indicating a problem with the peer or network. [Lines 487-510]
Peer Availability: The code now handles the case where no peer is available for the header or body request more gracefully. This is an improvement over previous error handling. [Lines 402, 528]

Security Vulnerabilities

Peer Validation: The code should ensure that the peers from which headers and bodies are requested are validated and trusted to avoid potential attacks such as providing incorrect headers or bodies. The current implementation assumes that any peer available is suitable, which might be a vulnerability.
Reentrancy and Concurrency: With the introduction of asynchronous operations and parallel processing, ensure that there are no reentrancy issues or race conditions, especially when dealing with shared state like snap_enabled.

Performance Implications

Parallelization: The parallelization of header download and state download is a significant improvement and should lead to faster sync times. However, monitor the impact on resource usage, especially during the initial sync phase when both processes are running concurrently.
Memory Usage: The introduction of channels and asynchronous communication might increase memory usage. Ensure that there are no memory leaks, especially when handling large volumes of headers and bodies.

Rust Best Practices and Idiomatic Patterns

Error Handling: Rust's Result and Option types are used effectively for error handling. Consider using more descriptive error types or custom error enums for better clarity and maintainability.
Async/Await: The use of async/await is appropriate for handling asynchronous operations. Ensure that all asynchronous code paths are properly awaited to avoid deadlocks or unhandled errors.

Memory Safety and Proper Error Handling

Memory Safety: The code appears to be safe in terms of memory usage, but the introduction of parallel operations might introduce new risks. Ensure thorough testing, especially around memory usage under load.
Error Propagation: Errors are propagated using Result and Option types. Ensure that all error paths are handled or logged appropriately to avoid silent failures.

Code Readability and Maintainability

Documentation: While the code is well-commented with debug logs, consider adding more inline comments or documentation for complex logic or architectural decisions.
Modularity: The separation of concerns between header download, state download, and block processing is well-maintained. This modularity aids in maintainability and future enhancements.

Ethereum-Specific Considerations

Consensus Rules and EIP Compliance: Ensure that the parallelized sync process adheres to all relevant Ethereum consensus rules and EIPs. This is critical for maintaining the integrity of the blockchain.
State Trie and Storage Operations: The changes do not directly affect state trie or storage operations, but the performance implications of these operations should be monitored, especially with the increased load from parallel operations.

Overall, the PR introduces a significant improvement to the sync process by parallelizing header and state downloads. However, it is crucial to thoroughly test the changes, especially for potential security vulnerabilities, performance implications, and adherence to Ethereum consensus rules.

Automated review by Kimi (Moonshot AI)

The verbose debug logging added for diagnosing block body download failures was causing excessive memory usage, leading to the process being killed with exit code 137 (OOM) after 29 minutes. Removes all [HEADER_DEBUG], [BODY_DEBUG], [FULLSYNC_DEBUG], and [SNAP_TO_FULL_DEBUG] log statements.

github-actions · 2026-02-03T16:05:53Z

Kimi AI Code Review

Code Correctness and Potential Bugs

Missing Error Handling in download_headers_background Function:
- In the download_headers_background function, the request_block_headers method is called, which can return None. However, there's no check for None after the call. This could lead to unwrapping an Option that is None, causing a panic.
- Suggested Change: Add a check for None after the request_block_headers call and handle it appropriately.
Potential Panic in sync_cycle_snap Function:
- The snap_sync function calls get_current_head and get_block_number without checking if the result is Ok. If either of these calls fails, it will panic when trying to access the ok value.
- Suggested Change: Add proper error handling for the results of get_current_head and get_block_number.

Security Vulnerabilities

Unvalidated Channel Closure:
- In the download_headers_background function, the channel closure is not checked before sending headers. If the channel is closed, this could lead to a panic.
- Suggested Change: Check if the channel is closed before sending headers and handle it appropriately.

Performance Implications

Parallelization of Header and State Download:
- The PR introduces parallelization of header download with state download, which is a significant performance improvement. However, the implementation should ensure that the parallel tasks do not interfere with each other or cause race conditions.
- Suggested Change: Review the synchronization mechanisms to ensure that the parallel tasks are properly synchronized.

Rust Best Practices and Idiomatic Patterns

Use of Arc<AtomicBool> for Shared State:
- The use of Arc<AtomicBool> for sharing state between tasks is appropriate and follows Rust's best practices for shared mutable state.
- Suggested Change: None.

Memory Safety and Proper Error Handling

Proper Error Handling:
- The code should handle potential errors at every point where an Option or Result is returned. This is crucial for memory safety and avoiding panics.
- Suggested Change: Add error handling for all Option and Result types.

Code Readability and Maintainability

Complexity in sync_cycle_snap Function:
- The sync_cycle_snap function is quite complex and handles multiple aspects of the synchronization process. This can make the code difficult to maintain and understand.
- Suggested Change: Consider breaking down the sync_cycle_snap function into smaller, more focused functions to improve readability and maintainability.

Ethereum-Specific Considerations

Consensus Rules and EIP Compliance:
- The code should be reviewed to ensure that it complies with all relevant Ethereum consensus rules and EIPs.
- Suggested Change: None, as the provided diff does not directly affect consensus rules or EIP compliance.
RLP Encoding/Decoding Correctness:
- The code does not directly involve RLP encoding/decoding, so no specific review is needed for this aspect.
- Suggested Change: None.
Transaction and Block Validation Logic:
- The code should be reviewed to ensure that all transactions and blocks are properly validated according to Ethereum's rules.
- Suggested Change: None, as the provided diff does not directly affect transaction and block validation logic.

Conclusion

The PR introduces a significant improvement by parallelizing header and state download during snap sync. However, it is crucial to ensure that the implementation is correct, secure, and maintainable. The suggestions provided aim to address potential issues and improve the overall quality of the code.

Automated review by Kimi (Moonshot AI)

in the multisync monitoring script (docker_monitor.py). The sync completion logs already contain per-phase completion markers (e.g. "✓ BLOCK HEADERS complete: 25,693,009 headers in 0:29:00") but this data was not surfaced in the Slack messages or run summaries. This adds a parse_phase_timings() function that reads saved container logs and extracts timing, count, and duration for all 8 snap sync phases: Block Headers, Account Ranges, Account Insertion, Storage Ranges, Storage Insertion, State Healing, Storage Healing, and Bytecodes. The breakdown is appended to both the Slack notification (as a code block per network instance) and the text-based run log (run_history.log and per-run summary.txt). When a phase did not complete (e.g. on a failed run), it is simply omitted from the breakdown.

pablodeymo · 2026-02-05T19:26:53Z

  Last 24h Runs (branch: feature/background-header-download, commit 441634d01)
  ┌─────┬──────────────┬─────────┬─────────┬─────────────┬────────────────────────┐
  │ Run │   Started    │ Network │ Result  │  Sync Time  │    Post-sync Block     │
  ├─────┼──────────────┼─────────┼─────────┼─────────────┼────────────────────────┤
  │ #94 │ Feb 4, 04:30 │ mainnet │ SUCCESS │ 4h 37m      │ 24382518 (+111 blocks) │
  ├─────┼──────────────┼─────────┼─────────┼─────────────┼────────────────────────┤
  │ #95 │ Feb 4, 09:36 │ mainnet │ SUCCESS │ 6h 31m      │ 24384610 (+106 blocks) │
  ├─────┼──────────────┼─────────┼─────────┼─────────────┼────────────────────────┤
  │ #96 │ Feb 4, 16:35 │ mainnet │ SUCCESS │ 4h 40m      │ 24386126 (+110 blocks) │
  ├─────┼──────────────┼─────────┼─────────┼─────────────┼────────────────────────┤
  │ #97 │ Feb 4, 21:43 │ mainnet │ SUCCESS │ 4h 49m      │ 24387704 (+109 blocks) │
  ├─────┼──────────────┼─────────┼─────────┼─────────────┼────────────────────────┤
  │ #98 │ Feb 5, 03:00 │ mainnet │ SUCCESS │ 4h 43m      │ 24389248 (+110 blocks) │
  ├─────┼──────────────┼─────────┼─────────┼─────────────┼────────────────────────┤
  │ #99 │ Feb 5, 08:12 │ mainnet │ FAILED  │ 8h+ timeout │ -                      │
  └─────┴──────────────┴─────────┴─────────┴─────────────┴────────────────────────┘

pablodeymo · 2026-02-05T19:28:03Z

⏺ Here's the timeline for Run #99:

Run #99 Timeline (PR #6059 - feature/background-header-download)
Time: 08:19
Event: Sync started (mainnet from block 0)
Elapsed: 0
────────────────────────────────────────
Time: 11:54
Event: Bytecodes download started
Elapsed: 3h 35m
────────────────────────────────────────
Time: 13:07
Event: Bytecodes download finished (last batch: 42,217)
Elapsed: 4h 48m
────────────────────────────────────────
Time: 13:07
Event: validate_bytecodes started
Elapsed: 4h 48m
────────────────────────────────────────
Time: 13:44
Event: First sync cycle finished (19,490s = 5h 25m total snap sync)
Elapsed: 5h 25m
────────────────────────────────────────
Time: 13:44-16:18
Event: Catch-up sync cycles (~4 min each, continuously)
Elapsed: 5h 25m - 8h+
────────────────────────────────────────
Time: ~16:18
Event: 8h timeout hit — monitor killed it
Elapsed: 8h 6m
Root Cause

The snap sync itself completed in 5h 25m (within normal range — previous runs were 4h
26m-6h 31m). The problem is what happened after:

After sync completed, the node entered catch-up mode doing full sync cycles every ~4
minutes
It kept doing these cycles for 2.5+ hours without the monitor detecting it as "done"
Meanwhile, the background header download feature was still downloading headers in
parallel (log ends showing headers being fetched from block ~24.3M downward)
The node never transitioned to the "processing new blocks" state that the monitor expects
within 22 minutes

Comparison with Successful Runs

In successful runs #92-#98, after snap sync the node caught up and started processing new
blocks within ~22 minutes. In #99, it was stuck in catch-up sync cycles for 2.5h+. This
could be:

The background header download competing for resources/bandwidth with the catch-up sync,
slowing it down
A larger gap to catch up (the sync took 5h25m vs typical 4h35m, meaning the chain
advanced further during sync)
Network conditions — many WARN about invalid/empty headers from peers throughout the
tail of the log

handling to fix clippy lint error

proof conversion helpers (encodable_to_proof, proof_to_encodable) from server.rs to mod.rs since they are shared utilities used by both client and server.

peer selection, extract magic numbers into named constants (ACCOUNT_RANGE_CHUNK_COUNT, STORAGE_BATCH_SIZE, HASH_MAX), remove unused contents field from DumpError and use derive Debug, rename missing_children to pending_children in healing code, and wrap process_byte_codes_request in spawn_blocking for consistency with other server handlers.

…functions so they no longer depend on the PeerHandler type as a method receiver. The three methods that used self (request_account_range, request_bytecodes, request_storage_ranges) now take peers: &mut PeerHandler as a parameter, and the two already-static methods (request_state_trienodes, request_storage_trienodes) simply move out of the impl block. Callers in snap_sync.rs, healing/state.rs, and healing/storage.rs are updated accordingly.

…ation' into feature/background-header-download

…header-download

Copilot

Pull request overview

This PR aims to improve L1 snap sync performance by downloading block headers concurrently with snap state download, so state sync can start immediately instead of waiting for all headers.

Changes:

Introduces a background header-download task and incremental header processing during snap_sync().
Adds snap protocol server tests for Hive AccountRange vectors.
Minor refactor in PeerHandler::request_block_bodies_inner and updates the changelog.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
`crates/networking/p2p/sync/snap_sync.rs`	Runs header download in a background task and incrementally processes headers during snap state sync.
`crates/networking/p2p/tests/snap_server_tests.rs`	Adds Hive-derived `AccountRange` tests and a local in-memory state fixture.
`crates/networking/p2p/peer_handler.rs`	Small refactor to store the request result before pattern matching.
`CHANGELOG.md`	Records the perf change entry.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-09T14:59:52Z

crates/networking/p2p/sync/snap_sync.rs

            return super::full::sync_cycle_full(
-                peers,
+                &mut peers,
                blockchain,
                tokio_util::sync::CancellationToken::new(),
                sync_head,


download_headers_background() triggers full::sync_cycle_full() inside the background header task. That causes full sync to run concurrently with the ongoing snap_sync() in the main task (both touching the store and consuming peers), and it also contradicts the comment that the main task will handle the switch. Instead, the background task should only signal the mode switch (e.g., set the flag + mark download complete) and return, letting sync_cycle_snap() abort snap sync and then run full sync in a single place.

Copilot · 2026-02-09T14:59:53Z

crates/networking/p2p/sync/snap_sync.rs

+        // Handle reorg case where sync head is not reachable
+        let current_head = store
+            .get_block_header(current_head_number)?
+            .map(|h| h.hash())
+            .unwrap_or(first_block_hash);


The reorg/stuck detection uses store.get_block_header(current_head_number) to derive current_head, but the background task is no longer persisting downloaded headers to the store. After the first batch, current_head_number will likely be beyond the local canonical head, so this lookup returns None and the check becomes ineffective/misleading. Track the last fetched head hash in a local variable (as the previous code did) or persist headers in the background task before using store-based checks.

Suggested change

// Handle reorg case where sync head is not reachable

let current_head = store

.get_block_header(current_head_number)?

.map(|h| h.hash())

.unwrap_or(first_block_hash);

// Handle reorg case where sync head is not reachable.

// Use the last fetched header as the current head instead of relying on the store,

// since the background task does not persist headers into the store.

let current_head = last_block_hash;

Copilot · 2026-02-09T14:59:53Z

crates/networking/p2p/sync/snap_sync.rs

+        // Check if we should abort (full sync triggered)
+        if should_abort_snap_sync(snap_enabled) {
+            info!("Snap sync aborted: switching to full sync");
+            return Ok(());


When should_abort_snap_sync() is true, snap_sync() returns Ok(()), and sync_cycle_snap() continues to clear snap state and finish successfully. This makes the mode-switch path ambiguous: if the background task is changed to only signal (and not run full sync itself), the caller will silently skip full sync. Consider returning a distinct status/error from snap_sync() (e.g., "SwitchToFullSync") and have sync_cycle_snap() perform the full sync (and skip clear_snap_state() if full sync is about to run).

Suggested change

return Ok(());

// Signal the caller via an error so it does not treat this as a

// successful completion and can perform the appropriate mode switch.

return Err(SyncError::NoBlockHeaders);

Copilot · 2026-02-09T14:59:54Z

crates/networking/p2p/tests/snap_server_tests.rs

+    // So I copied the state from a geth execution of the test suite
+
+    // State was trimmed to only the first 100 accounts (as the furthest account used by the tests is account 87)
+    // If the full 408 account state is needed check out previous commits the PR that added this code


Minor grammar nit in this comment: "check out previous commits the PR" is missing a preposition (e.g., "previous commits of the PR").

Suggested change

// If the full 408 account state is needed check out previous commits the PR that added this code

// If the full 408 account state is needed, check out previous commits of the PR that added this code

…nc full sync loop After snap sync completes, the node switches to full sync mode. Full sync downloads headers backward from the tip looking for a block that is_canonical_sync recognizes. This check requires entries in CANONICAL_BLOCK_HASHES (number→hash). The background header download task stored headers via add_block_headers, which only wrote to HEADERS and BLOCK_NUMBERS but not CANONICAL_BLOCK_HASHES. This caused full sync to never find a canonical ancestor, looping endlessly with "Sync failed to find target block header" until the 8h timeout. Fix: add CANONICAL_BLOCK_HASHES write to the existing transaction in add_block_headers (which has exactly one caller: snap sync background download).

pablodeymo · 2026-02-10T18:49:28Z

Attaching the investigation of the bug:
BACKGROUND_HEADER_DOWNLOAD_INVESTIGATION.md

…ng bug The block_hashes field stored hashes without block numbers, so the numbers_and_hashes construction for forkchoice_update inferred block numbers from position (pivot_header.number - i). When background header download and update_pivot interleaved inserts, entries ended up out of order, causing forkchoice_update to write wrong canonical hashes. BTreeMap keyed by block number naturally handles ordering and deduplicates overlapping ranges from update_pivot re-inserting the same block numbers.

ElFantasma · 2026-02-09T23:35:25Z

crates/networking/p2p/sync/snap_sync.rs

    }

-    snap_sync(peers, &store, &mut block_sync_state, datadir).await?;
+    download_complete.store(true, Ordering::Release);


Bug: download_complete is only set to true in the happy path (here) and in two specific early-returns (max attempts, full-sync switch). But if the function returns via ? error propagation (e.g., peers.request_block_headers(...) fails, store.add_block_headers(...) fails, etc.), download_complete is never set to true.

When this happens, the sender half of the channel is dropped (task exits), but snap_sync()'s initial-headers wait loop (line ~435) checks download_status().is_terminal() which returns false (flag still false), so it keeps calling recv_headers_timeout() → gets None (closed channel) → loops with 100ms sleeps forever.

Suggestion: use an RAII guard pattern to ensure download_complete is always set on exit:

struct DownloadCompleteGuard(Arc<AtomicBool>); impl Drop for DownloadCompleteGuard { fn drop(&mut self) { self.0.store(true, Ordering::Release); } } // At the start of download_headers_background: let _guard = DownloadCompleteGuard(download_complete.clone());

Then remove the manual download_complete.store(true, ...) calls.

ElFantasma · 2026-02-09T23:35:25Z

crates/networking/p2p/sync/snap_sync.rs

+            );
+            // We can't easily go back in the background task, so we just continue with the current head
+            // The update_pivot mechanism will handle this case
+            tokio::time::sleep(Duration::from_millis(100)).await;


The original code handled the reorg case by walking backward: current_head = first_block_parent_hash. The background task can't do this because current_head was removed in favor of current_head_number, and going back requires knowing the parent hash.

But the current code just sleeps 100ms and retries with the same current_head_number, which will get the same headers back and loop indefinitely. The comment says "The update_pivot mechanism will handle this case" but update_pivot runs in snap_sync() on the main task — it doesn't feed back into the background task's current_head_number.

Consider either:

Decrement current_head_number by 1 to walk backward (simpler, analogous to the old parent_hash approach)

Add a maximum retry count for this specific case

Use _first_block_parent_hash (currently prefixed with _) to look up the parent's block number

ElFantasma · 2026-02-09T23:35:25Z

crates/networking/p2p/sync/snap_sync.rs

-                .process_incoming_headers(block_headers_iter)
-                .await?;
+        // Send headers through channel (skip the first as we already have it)
+        if block_headers.len() > 1 && header_sender.send(block_headers).await.is_err() {


The first header is skipped by the receiver in process_pending_headers() (via .skip(1)), but here the full batch including the first header is sent. This means every batch sent through the channel carries one header that will be discarded. Not a bug, but it's a subtle contract — the sender and receiver must agree on the "skip first" convention.

More importantly: the original code only skipped the first header of the first batch (to avoid the already-known current head). Here, .skip(1) is applied to every batch received from the channel. If the background task sends multiple batches, the first header of each batch (which is the continuation point from the previous batch's last header) gets silently dropped. This could cause gaps in block_hashes — missing one header per batch.

Double-check that the header fetching always returns overlapping ranges (i.e., the first header of batch N+1 == the last header of batch N). If so, the skip is correct. If not, this introduces gaps.

ElFantasma · 2026-02-09T23:35:25Z

crates/networking/p2p/sync/snap_sync.rs

+    block_sync_state.set_header_channel(header_receiver, download_complete.clone());
+
+    // Create Arc wrapper for snap_enabled so we can share it with the background task
+    let snap_enabled_arc = Arc::new(AtomicBool::new(snap_enabled.load(Ordering::Relaxed)));


nit: Creating a separate Arc<AtomicBool> copy of snap_enabled means the Syncer's original snap_enabled (shared with SyncManager) doesn't see the full-sync switch until snap_sync() completes and the sync-back on line ~389 runs. If SyncManager or other code checks snap_enabled concurrently during the sync cycle, it'll still see true even though the background task decided to switch to full sync.

In practice this is probably fine since the sync cycle is the only writer, but it's worth a comment explaining why a copy is used instead of sharing the original Arc.

pablodeymo added 17 commits January 22, 2026 13:19

docs: add snap sync refactoring plan

9f9214f

Document the phased approach for reorganizing snap sync code: - Phase 1: rlpx/snap module split - Phase 2: snap module split with server extraction - Phase 3: healing module unification

Merge branch 'main' into refactor/snapsync-healing-unification

57ebf37

fix(l1): use consistent usize type for missing_children_count

f680777

Change missing_children_count from u64 to usize in HealingQueueEntry and node_missing_children function to match StorageHealingQueueEntry and be consistent with memory structure counting conventions.

fix(l1): fix typos in healing module comments

17744dc

fix(l1): fix typo in snap client error message

c40de29

fix(l1): prevent panic on empty accounts vector in snap client

ffba3fe

fix(l1): handle empty bytecode hashes in request_bytecodes

a682d76

fix(l1): prevent panics from empty vector indexing in snap client

6c99453

fix(l1): prevent zero chunk_size in request_account_range

91b926e

Merge branch 'main' into refactor/snapsync-healing-unification

f3983c3

pablodeymo requested a review from a team as a code owner January 28, 2026 21:36

github-actions bot assigned pablodeymo Jan 28, 2026

pablodeymo changed the title ~~Parallelize header download with state download during snap sync~~ perf(l1): parallelize header download with state download during snap sync Jan 28, 2026

github-actions bot added L1 Ethereum client performance Block execution throughput and performance in general labels Jan 28, 2026

github-project-automation bot added this to ethrex_performance and ethrex_l1 Jan 28, 2026

github-project-automation bot moved this to Todo in ethrex_performance Jan 28, 2026

greptile-apps bot reviewed Jan 28, 2026

View reviewed changes

pablodeymo added 2 commits January 28, 2026 19:12

Apply rustfmt formatting to snap sync files

883b17e

clippy fix

e15518b

Merge branch 'main' into refactor/snapsync-healing-unification

8eabda7

greptile-apps bot mentioned this pull request Feb 3, 2026

docs(l1): snapsync roadmap #6112

Open

1 task

pablodeymo added the snapsync label Feb 4, 2026

pablodeymo added 3 commits February 4, 2026 17:57

Merge branch 'main' into refactor/snapsync-healing-unification

161cf78

Merge branch 'main' into refactor/snapsync-healing-unification

d01f055

pablodeymo added 7 commits February 5, 2026 16:44

Replace single-pattern match with if let in block bodies response

78c7b68

handling to fix clippy lint error

Propagate errors instead of panicking in request_storage_ranges and move

1414a3b

proof conversion helpers (encodable_to_proof, proof_to_encodable) from server.rs to mod.rs since they are shared utilities used by both client and server.

Merge remote-tracking branch 'origin/refactor/snapsync-healing-unific…

cca520c

…ation' into feature/background-header-download

Fix cargo fmt issue in peer_handler.rs introduced by merge

2d136d8

Merge branch 'feature/slack-phase-breakdown' into feature/background-…

c78cddc

…header-download

Base automatically changed from refactor/snapsync-healing-unification to main February 6, 2026 21:52

pablodeymo added 2 commits February 6, 2026 19:03

Merge branch 'main' into feature/background-header-download

9235125

Merge branch 'main' into feature/background-header-download

e69e21e

Copilot AI review requested due to automatic review settings February 9, 2026 14:50

Copilot started reviewing on behalf of pablodeymo February 9, 2026 14:51 View session

Copilot AI reviewed Feb 9, 2026

View reviewed changes

pablodeymo added 2 commits February 10, 2026 15:50

Merge branch 'main' into feature/background-header-download

0d3d3e6

ElFantasma requested changes Feb 11, 2026

View reviewed changes

github-project-automation bot moved this to In Progress in ethrex_l1 Feb 11, 2026

	// If the full 408 account state is needed check out previous commits the PR that added this code
	// If the full 408 account state is needed, check out previous commits of the PR that added this code

perf(l1): parallelize header download with state download during snap sync #6059

Are you sure you want to change the base?

perf(l1): parallelize header download with state download during snap sync #6059

Conversation

pablodeymo commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Implementation

Test plan

Uh oh!

github-actions bot commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Lines of code report

Uh oh!

greptile-apps bot commented Jan 28, 2026

Greptile Overview

Greptile Summary

Confidence Score: 3.5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 3, 2026

Kimi AI Code Review

Code Correctness and Potential Bugs

Security Vulnerabilities

Performance Implications

Rust Best Practices and Idiomatic Patterns

Memory Safety and Proper Error Handling

Code Readability and Maintainability

Ethereum-Specific Considerations

Uh oh!

github-actions bot commented Feb 3, 2026

Kimi AI Code Review

Code Correctness and Potential Bugs

Security Vulnerabilities

Performance Implications

Rust Best Practices and Idiomatic Patterns

Memory Safety and Proper Error Handling

Code Readability and Maintainability

Ethereum-Specific Considerations

Conclusion

Uh oh!

pablodeymo commented Feb 5, 2026

Uh oh!

pablodeymo commented Feb 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

pablodeymo commented Feb 10, 2026

pablodeymo commented Jan 28, 2026 •

edited

Loading

github-actions bot commented Jan 28, 2026 •

edited

Loading