Skip to content

Conversation

@pablodeymo
Copy link
Contributor

@pablodeymo pablodeymo commented Jan 28, 2026

Summary

  • Moves header downloading to a background task that runs in parallel with state download
  • State download can now begin immediately instead of waiting for all headers to be downloaded
  • Adds incremental header processing at strategic points during snap_sync()
  • Fixes block_hashes ordering bug by replacing Vec<H256> with BTreeMap<u64, H256>

Motivation

Previously, sync_cycle_snap() downloaded ALL block headers before starting state download. This caused unnecessary delays since:

  1. State download is independent of header download
  2. Pivot selection only needs recent headers, not all of them
  3. For mainnet with millions of blocks, header download alone takes significant time

Additionally, the block_hashes: Vec<H256> field stored hashes without block numbers. The numbers_and_hashes construction for forkchoice_update inferred block numbers from position (pivot_header.number - i). When the background header download and update_pivot interleaved inserts, entries ended up out of order, causing forkchoice_update to write wrong canonical hashes.

Implementation

  • Add header_receiver and download_complete fields to SnapBlockSyncState
  • Create download_headers_background() function that runs header download in a separate tokio task
  • Modify sync_cycle_snap() to spawn the background task and start state download immediately
  • Add process_pending_headers() helper to incrementally process headers at strategic points
  • snap_sync() waits for initial headers only when needed for pivot selection
  • Replace block_hashes: Vec<H256> with BTreeMap<u64, H256> keyed by block number — this naturally handles ordering and deduplicates overlapping ranges from update_pivot re-inserting the same block numbers
  • Build numbers_and_hashes directly from BTreeMap keys instead of inferring from position

Test plan

  • Run snap sync on testnet and verify parallel execution via logs
  • Verify state download starts before header download completes
  • Check memory usage to ensure channel doesn't buffer too many headers
  • Compare sync time before/after the change
  • Verify forkchoice_update writes correct canonical hashes after snap sync

…tory

Reorganize state_healing.rs and storage_healing.rs into a shared
sync/healing/ module structure with clearer naming conventions:

- Create sync/healing/ directory with mod.rs, types.rs, state.rs, storage.rs
- Rename MembatchEntryValue to HealingQueueEntry
- Rename MembatchEntry to StorageHealingQueueEntry
- Rename Membatch type to StorageHealingQueue
- Rename children_not_in_storage_count to missing_children_count
- Rename membatch variables to healing_queue throughout
- Extract shared HealingQueueEntry and StateHealingQueue types to types.rs
- Update sync.rs imports to use new healing module
Reorganize snap protocol code for better maintainability:

- Split rlpx/snap.rs into rlpx/snap/ directory:
  - codec.rs: RLP encoding/decoding for snap messages
  - messages.rs: Snap protocol message types
  - mod.rs: Module re-exports

- Split snap.rs into snap/ directory:
  - constants.rs: Snap sync constants and configuration
  - server.rs: Snap protocol server implementation
  - mod.rs: Module re-exports

- Move snap server tests to dedicated tests/ directory
- Update imports in p2p.rs, peer_handler.rs, and code_collector.rs
Document the phased approach for reorganizing snap sync code:
- Phase 1: rlpx/snap module split
- Phase 2: snap module split with server extraction
- Phase 3: healing module unification
Split the large sync.rs (1631 lines) into focused modules:

- sync/full.rs (~260 lines): Full sync implementation
  - sync_cycle_full(), add_blocks_in_batch(), add_blocks()

- sync/snap_sync.rs (~1100 lines): Snap sync implementation
  - sync_cycle_snap(), snap_sync(), SnapBlockSyncState
  - store_block_bodies(), update_pivot(), block_is_stale()
  - validate_state_root(), validate_storage_root(), validate_bytecodes()
  - insert_accounts(), insert_storages() (both rocksdb and non-rocksdb)

- sync.rs (~285 lines): Orchestration layer
  - Syncer struct with start_sync() and sync_cycle()
  - SyncMode, SyncError, AccountStorageRoots types
  - Re-exports for public API
…p/client.rs

Move all snap protocol client-side request methods from peer_handler.rs
to a dedicated snap/client.rs module:
- request_account_range and request_account_range_worker
- request_bytecodes
- request_storage_ranges and request_storage_ranges_worker
- request_state_trienodes
- request_storage_trienodes

Also moves related types: DumpError, RequestMetadata, SnapClientError,
RequestStateTrieNodesError, RequestStorageTrieNodes.

This reduces peer_handler.rs from 2,060 to 670 lines (~68% reduction),
leaving it focused on ETH protocol methods (block headers/bodies).

Added SnapClientError variant to SyncError for proper error handling.
Updated plan_snap_sync.md to mark Phase 4 as complete.
…napError type

Implement Phase 5 of snap sync refactoring plan - Error Handling.

- Create snap/error.rs with unified SnapError enum covering all snap protocol errors
- Update server functions (process_account_range_request, process_storage_ranges_request,
  process_byte_codes_request, process_trie_nodes_request) to return Result<T, SnapError>
- Remove SnapClientError and RequestStateTrieNodesError, consolidate into SnapError
- Keep RequestStorageTrieNodesError struct for request ID tracking in storage healing
- Add From<SnapError> for PeerConnectionError to support error propagation in message handlers
- Update sync module to use SyncError::Snap variant
- Update healing modules (state.rs, storage.rs) to use new error types
- Move DumpError struct to error.rs module
- Update test return types to use SnapError
- Mark Phase 5 as completed in plan document

All phases of the snap sync refactoring are now complete.
Change missing_children_count from u64 to usize in HealingQueueEntry
and node_missing_children function to match StorageHealingQueueEntry
and be consistent with memory structure counting conventions.
Resolve conflicts from #5977 and #6018 merge to main:
- Keep modular sync structure (sync.rs delegates to full.rs and snap_sync.rs)
- Keep snap client code in snap/client.rs (removed from peer_handler.rs)
- Add InsertingAccountRanges metric from #6018 to snap_sync.rs
- Remove unused info import from peer_handler.rs
Previously, sync_cycle_snap() downloaded ALL block headers before starting
state download. This caused unnecessary delays since state download is
independent of header download and the pivot selection only needs recent
headers.

This change moves header downloading to a background task that runs in
parallel with state download:

- Add header_receiver and download_complete fields to SnapBlockSyncState
- Create download_headers_background() function that runs header download
  in a separate tokio task and sends headers through an mpsc channel
- Modify sync_cycle_snap() to spawn the background task and start state
  download immediately
- Add process_pending_headers() helper to incrementally process headers
  at strategic points during snap_sync()
- snap_sync() waits for initial headers only when needed for pivot selection

This allows state download to begin much sooner, potentially reducing
overall sync time significantly for mainnet syncs.
@pablodeymo pablodeymo requested a review from a team as a code owner January 28, 2026 21:36
@pablodeymo pablodeymo changed the title Parallelize header download with state download during snap sync perf(l1): parallelize header download with state download during snap sync Jan 28, 2026
@github-actions github-actions bot added L1 Ethereum client performance Block execution throughput and performance in general labels Jan 28, 2026
@github-actions
Copy link

github-actions bot commented Jan 28, 2026

Lines of code report

Total lines added: 167
Total lines removed: 0
Total lines changed: 167

Detailed view
+------------------------------------------------+-------+------+
| File                                           | Lines | Diff |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/peer_handler.rs   | 547   | +1   |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/snap_sync.rs | 1110  | +165 |
+------------------------------------------------+-------+------+
| ethrex/crates/storage/store.rs                 | 2429  | +1   |
+------------------------------------------------+-------+------+

@greptile-apps
Copy link

greptile-apps bot commented Jan 28, 2026

Greptile Overview

Greptile Summary

This PR successfully parallelizes header download with state download during snap sync by introducing a background task pattern with channel-based communication.

Key Changes:

  • Removed Clone derive from SnapBlockSyncState and added channel receiver and completion flag fields
  • Created download_headers_background() function that runs header download in a separate tokio task
  • Modified sync_cycle_snap() to spawn the background task immediately and proceed with state download without waiting
  • Added process_pending_headers() helper called at strategic points during state download to incrementally process headers
  • snap_sync() waits for initial headers in a loop before selecting pivot, then periodically processes incoming headers

Architecture:
The implementation uses proper async patterns with tokio channels (buffer=100) and Arc<AtomicBool> for synchronization. The background task sends header batches through the channel, while the main task non-blockingly receives them at key points.

Potential Issues:

  • One logic issue: potential error propagation problem in background task when fetching current header (line 179)
  • Minor inefficiency: busy-wait loop for initial headers uses 100ms sleeps instead of blocking receive
  • Suboptimal: when background task switches to full sync, main task continues snap sync unnecessarily
  • Channel buffer size (100) is hardcoded and may need tuning based on actual workloads

Confidence Score: 3.5/5

  • Safe to merge with minor issues - core parallelization logic is sound but has optimization opportunities
  • The refactoring successfully achieves the stated goal of parallelizing header and state downloads. The synchronization primitives (channels, atomic flags) are used correctly. However, there's one potential logic issue with error handling in the background task, and several areas where efficiency could be improved (busy-wait loops, unnecessary work when switching sync modes). The changes are well-structured but would benefit from the suggested improvements.
  • No files require special attention beyond the comments provided

Important Files Changed

Filename Overview
crates/networking/p2p/sync/snap_sync.rs Refactors header download to run in background task parallel with state download, with proper channel-based synchronization

Sequence Diagram

sequenceDiagram
    participant Main as sync_cycle_snap (main task)
    participant BG as download_headers_background (spawned task)
    participant Channel as mpsc::channel
    participant State as snap_sync (state download)
    
    Main->>Main: Initialize block_sync_state
    Main->>Channel: Create channel (buffer=100)
    Main->>Main: Clone peers, store, blockchain
    Main->>BG: Spawn background task with clones
    activate BG
    Main->>State: Call snap_sync immediately
    activate State
    
    par Header Download
        loop Until sync_head found
            BG->>BG: Request block headers
            BG->>BG: Process & filter headers
            alt Headers available
                BG->>Channel: Send headers (non-blocking)
            end
            alt Sync head found or full sync needed
                BG->>BG: Set download_complete flag
                BG->>BG: Optionally switch to full sync
            end
        end
        BG->>BG: Set download_complete = true
        deactivate BG
    and State Download
        State->>State: Wait for initial headers (busy loop)
        State->>Channel: try_receive_headers
        State->>State: Select pivot header
        loop Account & Storage Download
            State->>State: request_account_range
            State->>State: insert_accounts
            State->>Channel: process_pending_headers (non-blocking)
            State->>State: request_storage_ranges
            State->>Channel: process_pending_headers (non-blocking)
            State->>State: insert_storages
        end
        loop Healing
            State->>State: heal_state_trie_wrap
            State->>State: heal_storage_trie
            State->>Channel: process_pending_headers (non-blocking)
        end
        State->>Channel: Final process_pending_headers
        State-->>Main: Return
        deactivate State
    end
    
    Main->>BG: Await background task completion
    Main->>Main: Sync snap_enabled state from Arc
    Main->>Main: Clear snap state & return
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 6 comments

Edit Code Review Agent Settings | Greptile

);

// Create channel for header communication between background task and main snap_sync
let (header_sender, header_receiver) = tokio::sync::mpsc::channel(100);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Channel buffer size of 100 could cause backpressure if header download significantly outpaces state download. Consider making buffer size configurable or calculating based on expected header batch sizes

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/snap_sync.rs
Line: 285:285

Comment:
Channel buffer size of 100 could cause backpressure if header download significantly outpaces state download. Consider making buffer size configurable or calculating based on expected header batch sizes

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 89 to 93
self.download_complete
.as_ref()
.map(|flag| flag.load(Ordering::Acquire))
.unwrap_or(true)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Ordering::Acquire here is correct for reading the flag. However, consider the semantics: when download_complete is None, returning true means "download is complete", which could be misleading since no download was started. Consider documenting this behavior or returning a different value.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/snap_sync.rs
Line: 89:93

Comment:
Using `Ordering::Acquire` here is correct for reading the flag. However, consider the semantics: when `download_complete` is `None`, returning `true` means "download is complete", which could be misleading since no download was started. Consider documenting this behavior or returning a different value.

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 373 to 383
while block_sync_state.block_hashes.is_empty() {
if block_sync_state.is_download_complete() {
return Err(SyncError::NoBlockHeaders);
}
// Process any available headers
process_pending_headers(block_sync_state).await?;
if block_sync_state.block_hashes.is_empty() {
debug!("Waiting for initial headers from background task...");
tokio::time::sleep(Duration::from_millis(100)).await;
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This busy-wait loop with 100ms sleeps could be inefficient. Consider using a blocking receive with timeout or a condvar-based approach for the initial header wait

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/snap_sync.rs
Line: 373:383

Comment:
This busy-wait loop with 100ms sleeps could be inefficient. Consider using a blocking receive with timeout or a condvar-based approach for the initial header wait

How can I resolve this? If you propose a fix, please make it concise.

@github-actions
Copy link

github-actions bot commented Feb 3, 2026

Kimi AI Code Review

This PR introduces a significant change to the synchronization process of the Ethereum client by parallelizing the header download with the state download during snap sync. Here are my detailed comments and suggestions:

Code Correctness and Potential Bugs

  1. Logging and Debugging: The addition of debug logs is a good practice for tracing the flow and understanding the behavior of the code. However, ensure that these logs do not introduce any performance overhead in production environments. [Lines 393, 402, 419, etc.]

  2. Error Handling: The changes in error handling for the block bodies request seem appropriate. However, it would be beneficial to also handle the case where the number of received block bodies does not match the number of requested hashes, potentially indicating a problem with the peer or network. [Lines 487-510]

  3. Peer Availability: The code now handles the case where no peer is available for the header or body request more gracefully. This is an improvement over previous error handling. [Lines 402, 528]

Security Vulnerabilities

  1. Peer Validation: The code should ensure that the peers from which headers and bodies are requested are validated and trusted to avoid potential attacks such as providing incorrect headers or bodies. The current implementation assumes that any peer available is suitable, which might be a vulnerability.

  2. Reentrancy and Concurrency: With the introduction of asynchronous operations and parallel processing, ensure that there are no reentrancy issues or race conditions, especially when dealing with shared state like snap_enabled.

Performance Implications

  1. Parallelization: The parallelization of header download and state download is a significant improvement and should lead to faster sync times. However, monitor the impact on resource usage, especially during the initial sync phase when both processes are running concurrently.

  2. Memory Usage: The introduction of channels and asynchronous communication might increase memory usage. Ensure that there are no memory leaks, especially when handling large volumes of headers and bodies.

Rust Best Practices and Idiomatic Patterns

  1. Error Handling: Rust's Result and Option types are used effectively for error handling. Consider using more descriptive error types or custom error enums for better clarity and maintainability.

  2. Async/Await: The use of async/await is appropriate for handling asynchronous operations. Ensure that all asynchronous code paths are properly awaited to avoid deadlocks or unhandled errors.

Memory Safety and Proper Error Handling

  1. Memory Safety: The code appears to be safe in terms of memory usage, but the introduction of parallel operations might introduce new risks. Ensure thorough testing, especially around memory usage under load.

  2. Error Propagation: Errors are propagated using Result and Option types. Ensure that all error paths are handled or logged appropriately to avoid silent failures.

Code Readability and Maintainability

  1. Documentation: While the code is well-commented with debug logs, consider adding more inline comments or documentation for complex logic or architectural decisions.

  2. Modularity: The separation of concerns between header download, state download, and block processing is well-maintained. This modularity aids in maintainability and future enhancements.

Ethereum-Specific Considerations

  1. Consensus Rules and EIP Compliance: Ensure that the parallelized sync process adheres to all relevant Ethereum consensus rules and EIPs. This is critical for maintaining the integrity of the blockchain.

  2. State Trie and Storage Operations: The changes do not directly affect state trie or storage operations, but the performance implications of these operations should be monitored, especially with the increased load from parallel operations.

Overall, the PR introduces a significant improvement to the sync process by parallelizing header and state downloads. However, it is crucial to thoroughly test the changes, especially for potential security vulnerabilities, performance implications, and adherence to Ethereum consensus rules.


Automated review by Kimi (Moonshot AI)

The verbose debug logging added for diagnosing block body download failures
was causing excessive memory usage, leading to the process being killed
with exit code 137 (OOM) after 29 minutes.

Removes all [HEADER_DEBUG], [BODY_DEBUG], [FULLSYNC_DEBUG], and
[SNAP_TO_FULL_DEBUG] log statements.
@github-actions
Copy link

github-actions bot commented Feb 3, 2026

Kimi AI Code Review

Code Correctness and Potential Bugs

  1. Missing Error Handling in download_headers_background Function:

    • In the download_headers_background function, the request_block_headers method is called, which can return None. However, there's no check for None after the call. This could lead to unwrapping an Option that is None, causing a panic.
    • Suggested Change: Add a check for None after the request_block_headers call and handle it appropriately.
  2. Potential Panic in sync_cycle_snap Function:

    • The snap_sync function calls get_current_head and get_block_number without checking if the result is Ok. If either of these calls fails, it will panic when trying to access the ok value.
    • Suggested Change: Add proper error handling for the results of get_current_head and get_block_number.

Security Vulnerabilities

  1. Unvalidated Channel Closure:
    • In the download_headers_background function, the channel closure is not checked before sending headers. If the channel is closed, this could lead to a panic.
    • Suggested Change: Check if the channel is closed before sending headers and handle it appropriately.

Performance Implications

  1. Parallelization of Header and State Download:
    • The PR introduces parallelization of header download with state download, which is a significant performance improvement. However, the implementation should ensure that the parallel tasks do not interfere with each other or cause race conditions.
    • Suggested Change: Review the synchronization mechanisms to ensure that the parallel tasks are properly synchronized.

Rust Best Practices and Idiomatic Patterns

  1. Use of Arc<AtomicBool> for Shared State:
    • The use of Arc<AtomicBool> for sharing state between tasks is appropriate and follows Rust's best practices for shared mutable state.
    • Suggested Change: None.

Memory Safety and Proper Error Handling

  1. Proper Error Handling:
    • The code should handle potential errors at every point where an Option or Result is returned. This is crucial for memory safety and avoiding panics.
    • Suggested Change: Add error handling for all Option and Result types.

Code Readability and Maintainability

  1. Complexity in sync_cycle_snap Function:
    • The sync_cycle_snap function is quite complex and handles multiple aspects of the synchronization process. This can make the code difficult to maintain and understand.
    • Suggested Change: Consider breaking down the sync_cycle_snap function into smaller, more focused functions to improve readability and maintainability.

Ethereum-Specific Considerations

  1. Consensus Rules and EIP Compliance:

    • The code should be reviewed to ensure that it complies with all relevant Ethereum consensus rules and EIPs.
    • Suggested Change: None, as the provided diff does not directly affect consensus rules or EIP compliance.
  2. RLP Encoding/Decoding Correctness:

    • The code does not directly involve RLP encoding/decoding, so no specific review is needed for this aspect.
    • Suggested Change: None.
  3. Transaction and Block Validation Logic:

    • The code should be reviewed to ensure that all transactions and blocks are properly validated according to Ethereum's rules.
    • Suggested Change: None, as the provided diff does not directly affect transaction and block validation logic.

Conclusion

The PR introduces a significant improvement by parallelizing header and state download during snap sync. However, it is crucial to ensure that the implementation is correct, secure, and maintainable. The suggestions provided aim to address potential issues and improve the overall quality of the code.


Automated review by Kimi (Moonshot AI)

@greptile-apps greptile-apps bot mentioned this pull request Feb 3, 2026
1 task
in the multisync monitoring script (docker_monitor.py).

The sync completion logs already contain per-phase completion markers
(e.g. "✓ BLOCK HEADERS complete: 25,693,009 headers in 0:29:00")
but this data was not surfaced in the Slack messages or run summaries.

This adds a parse_phase_timings() function that reads saved container
logs and extracts timing, count, and duration for all 8 snap sync
phases: Block Headers, Account Ranges, Account Insertion, Storage
Ranges, Storage Insertion, State Healing, Storage Healing, and
Bytecodes. The breakdown is appended to both the Slack notification
(as a code block per network instance) and the text-based run log
(run_history.log and per-run summary.txt). When a phase did not
complete (e.g. on a failed run), it is simply omitted from the
breakdown.
@pablodeymo
Copy link
Contributor Author

  Last 24h Runs (branch: feature/background-header-download, commit 441634d01)
  ┌─────┬──────────────┬─────────┬─────────┬─────────────┬────────────────────────┐
  │ Run │   Started    │ Network │ Result  │  Sync Time  │    Post-sync Block     │
  ├─────┼──────────────┼─────────┼─────────┼─────────────┼────────────────────────┤
  │ #94 │ Feb 4, 04:30 │ mainnet │ SUCCESS │ 4h 37m      │ 24382518 (+111 blocks) │
  ├─────┼──────────────┼─────────┼─────────┼─────────────┼────────────────────────┤
  │ #95 │ Feb 4, 09:36 │ mainnet │ SUCCESS │ 6h 31m      │ 24384610 (+106 blocks) │
  ├─────┼──────────────┼─────────┼─────────┼─────────────┼────────────────────────┤
  │ #96 │ Feb 4, 16:35 │ mainnet │ SUCCESS │ 4h 40m      │ 24386126 (+110 blocks) │
  ├─────┼──────────────┼─────────┼─────────┼─────────────┼────────────────────────┤
  │ #97 │ Feb 4, 21:43 │ mainnet │ SUCCESS │ 4h 49m      │ 24387704 (+109 blocks) │
  ├─────┼──────────────┼─────────┼─────────┼─────────────┼────────────────────────┤
  │ #98 │ Feb 5, 03:00 │ mainnet │ SUCCESS │ 4h 43m      │ 24389248 (+110 blocks) │
  ├─────┼──────────────┼─────────┼─────────┼─────────────┼────────────────────────┤
  │ #99 │ Feb 5, 08:12 │ mainnet │ FAILED  │ 8h+ timeout │ -                      │
  └─────┴──────────────┴─────────┴─────────┴─────────────┴────────────────────────┘

@pablodeymo
Copy link
Contributor Author

⏺ Here's the timeline for Run #99:

Run #99 Timeline (PR #6059 - feature/background-header-download)
Time: 08:19
Event: Sync started (mainnet from block 0)
Elapsed: 0
────────────────────────────────────────
Time: 11:54
Event: Bytecodes download started
Elapsed: 3h 35m
────────────────────────────────────────
Time: 13:07
Event: Bytecodes download finished (last batch: 42,217)
Elapsed: 4h 48m
────────────────────────────────────────
Time: 13:07
Event: validate_bytecodes started
Elapsed: 4h 48m
────────────────────────────────────────
Time: 13:44
Event: First sync cycle finished (19,490s = 5h 25m total snap sync)
Elapsed: 5h 25m
────────────────────────────────────────
Time: 13:44-16:18
Event: Catch-up sync cycles (~4 min each, continuously)
Elapsed: 5h 25m - 8h+
────────────────────────────────────────
Time: ~16:18
Event: 8h timeout hit — monitor killed it
Elapsed: 8h 6m
Root Cause

The snap sync itself completed in 5h 25m (within normal range — previous runs were 4h
26m-6h 31m). The problem is what happened after:

  • After sync completed, the node entered catch-up mode doing full sync cycles every ~4
    minutes
  • It kept doing these cycles for 2.5+ hours without the monitor detecting it as "done"
  • Meanwhile, the background header download feature was still downloading headers in
    parallel (log ends showing headers being fetched from block ~24.3M downward)
  • The node never transitioned to the "processing new blocks" state that the monitor expects
    within 22 minutes

Comparison with Successful Runs

In successful runs #92-#98, after snap sync the node caught up and started processing new
blocks within ~22 minutes. In #99, it was stuck in catch-up sync cycles for 2.5h+. This
could be:

  1. The background header download competing for resources/bandwidth with the catch-up sync,
    slowing it down
  2. A larger gap to catch up (the sync took 5h25m vs typical 4h35m, meaning the chain
    advanced further during sync)
  3. Network conditions — many WARN about invalid/empty headers from peers throughout the
    tail of the log

proof conversion helpers (encodable_to_proof, proof_to_encodable) from
server.rs to mod.rs since they are shared utilities used by both client
and server.
peer selection, extract magic numbers into named constants
(ACCOUNT_RANGE_CHUNK_COUNT, STORAGE_BATCH_SIZE, HASH_MAX), remove unused
contents field from DumpError and use derive Debug, rename
missing_children to pending_children in healing code, and wrap
process_byte_codes_request in spawn_blocking for consistency with other
server handlers.
…functions

so they no longer depend on the PeerHandler type as a method receiver. The three
methods that used self (request_account_range, request_bytecodes,
request_storage_ranges) now take peers: &mut PeerHandler as a parameter, and the
two already-static methods (request_state_trienodes, request_storage_trienodes)
simply move out of the impl block. Callers in snap_sync.rs, healing/state.rs,
and healing/storage.rs are updated accordingly.
…ation' into feature/background-header-download
Base automatically changed from refactor/snapsync-healing-unification to main February 6, 2026 21:52
Copilot AI review requested due to automatic review settings February 9, 2026 14:50
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to improve L1 snap sync performance by downloading block headers concurrently with snap state download, so state sync can start immediately instead of waiting for all headers.

Changes:

  • Introduces a background header-download task and incremental header processing during snap_sync().
  • Adds snap protocol server tests for Hive AccountRange vectors.
  • Minor refactor in PeerHandler::request_block_bodies_inner and updates the changelog.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
crates/networking/p2p/sync/snap_sync.rs Runs header download in a background task and incrementally processes headers during snap state sync.
crates/networking/p2p/tests/snap_server_tests.rs Adds Hive-derived AccountRange tests and a local in-memory state fixture.
crates/networking/p2p/peer_handler.rs Small refactor to store the request result before pattern matching.
CHANGELOG.md Records the perf change entry.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 285 to 289
return super::full::sync_cycle_full(
peers,
&mut peers,
blockchain,
tokio_util::sync::CancellationToken::new(),
sync_head,
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

download_headers_background() triggers full::sync_cycle_full() inside the background header task. That causes full sync to run concurrently with the ongoing snap_sync() in the main task (both touching the store and consuming peers), and it also contradicts the comment that the main task will handle the switch. Instead, the background task should only signal the mode switch (e.g., set the flag + mark download complete) and return, letting sync_cycle_snap() abort snap sync and then run full sync in a single place.

Copilot uses AI. Check for mistakes.
Comment on lines +226 to +230
// Handle reorg case where sync head is not reachable
let current_head = store
.get_block_header(current_head_number)?
.map(|h| h.hash())
.unwrap_or(first_block_hash);
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reorg/stuck detection uses store.get_block_header(current_head_number) to derive current_head, but the background task is no longer persisting downloaded headers to the store. After the first batch, current_head_number will likely be beyond the local canonical head, so this lookup returns None and the check becomes ineffective/misleading. Track the last fetched head hash in a local variable (as the previous code did) or persist headers in the background task before using store-based checks.

Suggested change
// Handle reorg case where sync head is not reachable
let current_head = store
.get_block_header(current_head_number)?
.map(|h| h.hash())
.unwrap_or(first_block_hash);
// Handle reorg case where sync head is not reachable.
// Use the last fetched header as the current head instead of relying on the store,
// since the background task does not persist headers into the store.
let current_head = last_block_hash;

Copilot uses AI. Check for mistakes.
// Check if we should abort (full sync triggered)
if should_abort_snap_sync(snap_enabled) {
info!("Snap sync aborted: switching to full sync");
return Ok(());
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When should_abort_snap_sync() is true, snap_sync() returns Ok(()), and sync_cycle_snap() continues to clear snap state and finish successfully. This makes the mode-switch path ambiguous: if the background task is changed to only signal (and not run full sync itself), the caller will silently skip full sync. Consider returning a distinct status/error from snap_sync() (e.g., "SwitchToFullSync") and have sync_cycle_snap() perform the full sync (and skip clear_snap_state() if full sync is about to run).

Suggested change
return Ok(());
// Signal the caller via an error so it does not treat this as a
// successful completion and can perform the appropriate mode switch.
return Err(SyncError::NoBlockHeaders);

Copilot uses AI. Check for mistakes.
// So I copied the state from a geth execution of the test suite

// State was trimmed to only the first 100 accounts (as the furthest account used by the tests is account 87)
// If the full 408 account state is needed check out previous commits the PR that added this code
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor grammar nit in this comment: "check out previous commits the PR" is missing a preposition (e.g., "previous commits of the PR").

Suggested change
// If the full 408 account state is needed check out previous commits the PR that added this code
// If the full 408 account state is needed, check out previous commits of the PR that added this code

Copilot uses AI. Check for mistakes.
…nc full sync loop

After snap sync completes, the node switches to full sync mode. Full sync
downloads headers backward from the tip looking for a block that is_canonical_sync
recognizes. This check requires entries in CANONICAL_BLOCK_HASHES (number→hash).

The background header download task stored headers via add_block_headers, which
only wrote to HEADERS and BLOCK_NUMBERS but not CANONICAL_BLOCK_HASHES. This
caused full sync to never find a canonical ancestor, looping endlessly with
"Sync failed to find target block header" until the 8h timeout.

Fix: add CANONICAL_BLOCK_HASHES write to the existing transaction in
add_block_headers (which has exactly one caller: snap sync background download).
@pablodeymo
Copy link
Contributor Author

Attaching the investigation of the bug:
BACKGROUND_HEADER_DOWNLOAD_INVESTIGATION.md

…ng bug

The block_hashes field stored hashes without block numbers, so the
numbers_and_hashes construction for forkchoice_update inferred block
numbers from position (pivot_header.number - i). When background header
download and update_pivot interleaved inserts, entries ended up out of
order, causing forkchoice_update to write wrong canonical hashes.

BTreeMap keyed by block number naturally handles ordering and deduplicates
overlapping ranges from update_pivot re-inserting the same block numbers.
}

snap_sync(peers, &store, &mut block_sync_state, datadir).await?;
download_complete.store(true, Ordering::Release);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: download_complete is only set to true in the happy path (here) and in two specific early-returns (max attempts, full-sync switch). But if the function returns via ? error propagation (e.g., peers.request_block_headers(...) fails, store.add_block_headers(...) fails, etc.), download_complete is never set to true.

When this happens, the sender half of the channel is dropped (task exits), but snap_sync()'s initial-headers wait loop (line ~435) checks download_status().is_terminal() which returns false (flag still false), so it keeps calling recv_headers_timeout() → gets None (closed channel) → loops with 100ms sleeps forever.

Suggestion: use an RAII guard pattern to ensure download_complete is always set on exit:

struct DownloadCompleteGuard(Arc<AtomicBool>);
impl Drop for DownloadCompleteGuard {
    fn drop(&mut self) {
        self.0.store(true, Ordering::Release);
    }
}
// At the start of download_headers_background:
let _guard = DownloadCompleteGuard(download_complete.clone());

Then remove the manual download_complete.store(true, ...) calls.

);
// We can't easily go back in the background task, so we just continue with the current head
// The update_pivot mechanism will handle this case
tokio::time::sleep(Duration::from_millis(100)).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code handled the reorg case by walking backward: current_head = first_block_parent_hash. The background task can't do this because current_head was removed in favor of current_head_number, and going back requires knowing the parent hash.

But the current code just sleeps 100ms and retries with the same current_head_number, which will get the same headers back and loop indefinitely. The comment says "The update_pivot mechanism will handle this case" but update_pivot runs in snap_sync() on the main task — it doesn't feed back into the background task's current_head_number.

Consider either:

  1. Decrement current_head_number by 1 to walk backward (simpler, analogous to the old parent_hash approach)
  2. Add a maximum retry count for this specific case
  3. Use _first_block_parent_hash (currently prefixed with _) to look up the parent's block number

.process_incoming_headers(block_headers_iter)
.await?;
// Send headers through channel (skip the first as we already have it)
if block_headers.len() > 1 && header_sender.send(block_headers).await.is_err() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first header is skipped by the receiver in process_pending_headers() (via .skip(1)), but here the full batch including the first header is sent. This means every batch sent through the channel carries one header that will be discarded. Not a bug, but it's a subtle contract — the sender and receiver must agree on the "skip first" convention.

More importantly: the original code only skipped the first header of the first batch (to avoid the already-known current head). Here, .skip(1) is applied to every batch received from the channel. If the background task sends multiple batches, the first header of each batch (which is the continuation point from the previous batch's last header) gets silently dropped. This could cause gaps in block_hashes — missing one header per batch.

Double-check that the header fetching always returns overlapping ranges (i.e., the first header of batch N+1 == the last header of batch N). If so, the skip is correct. If not, this introduces gaps.

block_sync_state.set_header_channel(header_receiver, download_complete.clone());

// Create Arc wrapper for snap_enabled so we can share it with the background task
let snap_enabled_arc = Arc::new(AtomicBool::new(snap_enabled.load(Ordering::Relaxed)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Creating a separate Arc<AtomicBool> copy of snap_enabled means the Syncer's original snap_enabled (shared with SyncManager) doesn't see the full-sync switch until snap_sync() completes and the sync-back on line ~389 runs. If SyncManager or other code checks snap_enabled concurrently during the sync cycle, it'll still see true even though the background task decided to switch to full sync.

In practice this is probably fine since the sync cycle is the only writer, but it's worth a comment explaining why a copy is used instead of sharing the original Arc.

@github-project-automation github-project-automation bot moved this to In Progress in ethrex_l1 Feb 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

L1 Ethereum client performance Block execution throughput and performance in general snapsync

Projects

Status: In Progress
Status: Todo

Development

Successfully merging this pull request may close these issues.

2 participants