Skip to content

Conversation

@fedacking
Copy link
Contributor

@fedacking fedacking commented Feb 10, 2026

Summary

  • Replace AccountStorageRoots with StorageTrieTracker throughout snap sync, eliminating index-based referencing and the accounts_by_root_hash intermediate structure
  • Introduce StorageTask / StorageTaskResult enums that move trie data into tasks and back in results, removing clones and simplifying the download loop
  • Use structured concurrency (JoinSet + try_join_next) instead of channels for worker communication in request_storage_ranges
  • Extract BigTrie::compute_intervals helper from the inline chunking logic

The full plan for this PR is documented in #6170

Test plan

  • cargo check -p ethrex-p2p compiles cleanly (default + rocksdb features)
  • cargo clippy -p ethrex-p2p passes with no warnings (default + rocksdb features)
  • cargo test -p ethrex-p2p — all 38 tests pass
  • Full snap sync test against a testnet peer (manual verification)

Open with Devin

@fedacking fedacking requested a review from a team as a code owner February 10, 2026 19:47
@github-actions github-actions bot added L1 Ethereum client performance Block execution throughput and performance in general labels Feb 10, 2026
@github-actions
Copy link

🤖 Kimi Code Review

Review Summary

This PR refactors the snap sync storage download logic by introducing a new StorageTrieTracker structure to replace the previous AccountStorageRoots. The changes improve code organization and separate concerns between small and big storage tries.

Issues Found

1. Potential Panic in handle_small_batch (Line 764-765)

let (big_root, big_trie) = tries
    .pop()
    .expect("tries should not be empty after split_off");
  • Issue: The expect could panic if tries is empty after split_off. While the comment suggests this shouldn't happen, there's no guarantee.
  • Fix: Use proper error handling instead of expect.

2. Race Condition in StorageTrieTracker::handle_healed_account (Lines 245-290)

  • Issue: When handling healed accounts, the function performs multiple operations on big_tries and small_tries without proper synchronization. While the current usage appears single-threaded, this could become problematic if the sync process becomes multi-threaded in the future.
  • Fix: Consider adding synchronization primitives or documenting thread-safety assumptions.

3. Inefficient Interval Calculation (Lines 350-387)

let storage_density = start_hash_u256 / slot_count;
let chunk_size = storage_density
    .checked_mul(U256::from(slots_per_chunk))
    .unwrap_or(U256::MAX);
  • Issue: The interval calculation uses floating-point-like division with U256, which could lead to very small chunk sizes when slot_count is large, potentially creating excessive intervals.
  • Fix: Add bounds checking and consider a minimum chunk size.

4. Missing Validation in StorageTrieTracker::return_small_tries (Lines 215-229)

  • Issue: When returning failed tries, the function blindly extends accounts without checking for duplicates or validating the trie structure.
  • Fix: Consider deduplication or validation of the returned tries.

5. Incomplete Error Handling in request_storage_ranges_worker (Lines 1098-1357)

  • Issue: The worker functions return StorageTaskResult variants but don't properly propagate all error conditions, particularly network timeouts or invalid responses.
  • Fix: Ensure all error paths are properly handled and logged.

Positive Aspects

  1. Better Separation of Concerns: The new StorageTrieTracker structure clearly separates small and big trie handling.
  2. Improved Type Safety: Using enums (StorageTask, StorageTaskResult) instead of tuples improves code clarity.
  3. Reduced Complexity: The main request_storage_ranges function is significantly simplified.
  4. Better Metrics: The new structure allows for more accurate progress tracking.

Minor Suggestions

  1. Documentation: Add more documentation to the new StorageTrieTracker methods, especially around the healing process.
  2. Constants: Consider making the 10_000 slots per chunk in BigTrie::compute_intervals a named constant.
  3. Testing: The new structure would benefit from unit tests for the interval calculation and account management logic.

Overall, this is a solid refactoring that improves the codebase structure, though a few edge cases need attention.


Automated review by Kimi (Moonshot AI) · custom prompt

@github-actions
Copy link

🤖 Claude Code Review

Based on my review of PR #6171, here's my code review:

Code Review - PR #6171: Refactor storage download to use StorageTrieTracker

Overall Assessment

This is a substantial refactoring that improves code organization and eliminates clones. The changes move from index-based referencing to a more structured approach using enums and dedicated tracker types. The refactoring is well-conceived, but there are several issues that need attention.

Critical Issues

1. Potential Panic in handle_healed_account (sync.rs:298, 304)

let big = self
    .big_tries
    .get_mut(&old_root)
    .expect("big_tries should contain old_root");

The code uses .expect() after checking contains_key, but there's a TOCTOU (Time-of-check to Time-of-use) issue if this code becomes concurrent. More importantly, the logic assumes in_big guarantees the key exists, but if the check at line 291 returns true and then the entry is somehow removed before line 295, this will panic.

Recommendation: Replace the two .expect() calls with proper error handling or use if let Some(big) = ... pattern.

2. Division by Zero Risk in compute_intervals (sync.rs:372)

let storage_density = start_hash_u256 / slot_count;

While slot_count is set to .max(1) on line 371, if start_hash_u256 is zero and slot_count is 1, you get storage_density = 0, which could lead to issues. More critically, the division start_hash_u256 / slot_count when start_hash_u256 < slot_count will result in zero, making chunk_size = 0 and potentially causing infinite loops or incorrect interval calculations.

Recommendation: Add validation that start_hash_u256 > 0 or handle the zero case explicitly. Consider using checked_div and proper error handling.

3. Integer Overflow in Interval Calculation (sync.rs:380)

let interval_start_u256 = start_hash_u256 + chunk_size * i;

The multiplication chunk_size * i could overflow before the addition. While U256 has a large range, this should use checked_mul and checked_add for safety in blockchain code.

Recommendation: Use checked arithmetic:

let offset = chunk_size.checked_mul(U256::from(i)).unwrap_or(U256::MAX);
let interval_start_u256 = start_hash_u256.checked_add(offset).unwrap_or(U256::MAX);

4. Unused Return Value from return_small_tries Method (sync.rs:263)

The method return_small_tries is defined but never called in the diff. This suggests incomplete implementation.

Recommendation: Either remove if unused or document why it's provided for future use.

Major Issues

5. Cloning in Promoted Big Trie Logic (client.rs:529, 561)

let intervals = BigTrie::compute_intervals(last_hash, slot_count, 10_000);
// ...
let accounts = tracker
    .big_tries
    .get(&big_root)
    .map(|b| b.accounts.clone())  // Clone here
    .unwrap_or_default();
for interval in intervals {
    tasks_queue_not_started.push_back(StorageTask::BigInterval {
        root: big_root,
        accounts: accounts.clone(),  // Clone for every interval

The PR claims to eliminate clones, but there are still significant clones happening when promoting to big tries and creating BigInterval tasks. The accounts vector is cloned for every interval.

Recommendation: Consider using Arc<Vec<H256>> for accounts if they're shared across multiple tasks.

6. Inconsistent Error Handling in Workers (client.rs:747, 773)

The worker functions return StorageTaskResult enum variants instead of using Result types. Failures return "failed" variants that are processed as normal results. This makes it harder to distinguish between network errors, validation errors, and actual failures.

Recommendation: Consider using Result<StorageTaskResult, WorkerError> for clearer error semantics.

7. Missing Validation in handle_small_batch (client.rs:832)

if (slots.is_empty() && proof.is_empty()) || slots.is_empty() || slots.len() > tries.len() {
    return StorageTaskResult::SmallFailed { tries, peer_id };
}

The condition slots.is_empty() is checked twice (redundant). More importantly, there's no upper bound check on individual slot counts per trie.

Recommendation: Simplify the condition and add per-trie validation.

Minor Issues

8. Magic Number (client.rs:529)

let intervals = BigTrie::compute_intervals(last_hash, slot_count, 10_000);

The value 10_000 (slots_per_chunk) is hardcoded here but should be a constant like SLOTS_PER_CHUNK for consistency with other constants in the codebase.

9. Potential Empty Accounts Vector (client.rs:555-557)

let accounts = tracker
    .big_tries
    .get(&big_root)
    .map(|b| b.accounts.clone())
    .unwrap_or_default();

If this returns an empty vector and we proceed to create BigInterval tasks, those tasks will have empty account lists, which could cause issues downstream.

Recommendation: Add a check and handle the empty case explicitly.

10. Inconsistent Peer Failure Recording (client.rs:608-612)

if slots.is_empty() {
    peers.peer_table.record_failure(&peer_id).await?;
} else {
    peers.peer_table.record_success(&peer_id).await?;
}

This is inside BigIntervalResult, but empty slots could be legitimate (the interval was already downloaded). The logic might incorrectly penalize peers.

Recommendation: Review whether empty slots is truly a failure condition for big intervals.

11. Missing Bounds Check (client.rs:920)

tries[i].1.slots = slot_data;

While the loop bounds should prevent out-of-bounds access, there's no explicit validation that i < tries.len() before this assignment.

Recommendation: Add an assertion or bounds check for safety.

Positive Aspects

  1. Structured Concurrency: The move from channels to JoinSet is excellent for structured concurrency and proper task lifecycle management.

  2. Type Safety: The use of enums (StorageTask, StorageTaskResult) provides better type safety than the previous index-based approach.

  3. Separation of Concerns: Extracting handle_small_batch and handle_big_interval improves readability and testability.

  4. Clearer Intent: The StorageTrieTracker makes the relationship between accounts and storage roots more explicit.

Recommendations

  1. Fix the critical arithmetic and panic issues before merging
  2. Add comprehensive tests for edge cases (zero slots, overflow conditions, empty tries)
  3. Consider using Arc for shared account vectors to truly eliminate clones
  4. Document the state machine transitions (small → big trie promotion)
  5. Add assertions or validation at critical points to catch invariant violations early

Test Coverage

The PR mentions tests pass, but given the complexity of the state machine logic, I'd recommend adding specific tests for:

  • Empty tries and edge cases in promotion logic
  • Overflow conditions in compute_intervals
  • Concurrent modifications in handle_healed_account
  • Failed workers and retry logic

Automated review by Claude (Anthropic) · custom prompt

@github-actions
Copy link

github-actions bot commented Feb 10, 2026

Lines of code report

Total lines added: 179
Total lines removed: 46
Total lines changed: 225

Detailed view
+------------------------------------------------+-------+------+
| File                                           | Lines | Diff |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/snap/client.rs    | 1148  | -34  |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync.rs           | 420   | +179 |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/snap_sync.rs | 933   | -12  |
+------------------------------------------------+-------+------+

@greptile-apps
Copy link

greptile-apps bot commented Feb 10, 2026

Greptile Overview

Greptile Summary

This PR refactors the storage download subsystem to use a cleaner StorageTrieTracker architecture, replacing the complex index-based AccountStorageRoots approach.

Key improvements:

  • Introduced StorageTask/StorageTaskResult enums that move trie data into tasks and results, eliminating clones and the intermediate accounts_by_root_hash structure
  • Replaced channel-based worker communication with structured concurrency using JoinSet and try_join_next
  • Extracted BigTrie::compute_intervals as a standalone helper for better separation of concerns
  • Simplified healing integration with new handle_healed_account method that encapsulates root migration logic

Minor issue found:

  • Redundant slots.is_empty() check in client.rs:1166 (see inline comment)

The refactor maintains equivalent functionality while improving code clarity and reducing unnecessary data copies. Tests pass and the design aligns well with the documented plan in issue #6170.

Confidence Score: 4/5

  • This PR is safe to merge with only minor cleanup needed
  • The refactoring is well-structured and maintains functional equivalence with the previous implementation. All tests pass and the code follows good architectural patterns. Only one minor redundant condition was found, which doesn't affect correctness.
  • No files require special attention

Important Files Changed

Filename Overview
crates/networking/p2p/snap/client.rs Major refactor of storage download logic using new StorageTask/StorageTaskResult enums and JoinSet-based concurrency; replaced clone-heavy channel communication with structured data flow
crates/networking/p2p/sync.rs Introduced new StorageTrieTracker with SmallTrie/BigTrie structures to replace index-based AccountStorageRoots; added interval computation helper for big tries

Sequence Diagram

sequenceDiagram
    participant Main as request_storage_ranges
    participant Tracker as StorageTrieTracker
    participant Queue as Task Queue
    participant Worker as Worker Tasks
    participant Peer as Peer Network

    Main->>Tracker: take_small_batch(STORAGE_BATCH_SIZE)
    Tracker-->>Main: SmallBatch tasks
    Main->>Queue: Queue SmallBatch tasks

    loop For each BigTrie
        Main->>Tracker: Get intervals from big_tries
        Main->>Queue: Queue BigInterval tasks
    end

    loop Until all tasks complete
        Main->>Worker: Spawn worker with task
        Worker->>Peer: GetStorageRanges request
        Peer-->>Worker: StorageRanges response
        
        alt SmallBatch completed
            Worker-->>Main: SmallComplete
            Main->>Main: Write to disk buffer
            Main->>Queue: Re-queue remaining tries
        else SmallBatch failed
            Worker-->>Main: SmallFailed
            Main->>Queue: Re-queue all tries
        else Small promoted to big
            Worker-->>Main: SmallPromotedToBig
            Main->>Tracker: promote_to_big()
            Main->>Main: Compute intervals
            Main->>Queue: Queue BigInterval tasks
        else BigInterval result
            Worker-->>Main: BigIntervalResult
            Main->>Main: Append slots to buffer
            Main->>Queue: Re-queue remaining interval
        end
    end

    Main->>Main: Flush disk buffers
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

tx.send(empty_task_result).await.ok();
return Ok(());

if (slots.is_empty() && proof.is_empty()) || slots.is_empty() || slots.len() > tries.len() {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant condition: slots.is_empty() is checked twice

Suggested change
if (slots.is_empty() && proof.is_empty()) || slots.is_empty() || slots.len() > tries.len() {
if slots.is_empty() || proof.is_empty() || slots.len() > tries.len() {
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/snap/client.rs
Line: 1166:1166

Comment:
redundant condition: `slots.is_empty()` is checked twice

```suggestion
    if slots.is_empty() || proof.is_empty() || slots.len() > tries.len() {
```

How can I resolve this? If you propose a fix, please make it concise.

@github-actions
Copy link

🤖 Codex Code Review

Findings

  • Critical – division by zero/panic in interval computation: BigTrie::compute_intervals can compute storage_density = 0 when last_downloaded_hash is small (e.g., 0 or < slot_count), which yields chunk_size = 0 and then divides by zero in missing_storage_range / chunk_size. This will panic and abort snap sync. Guard with chunk_size = max(1) or early-return a single [start..MAX] interval. crates/networking/p2p/sync.rs:369-376
  • High – stale storage data reused after root change: handle_healed_account re-keys/clones BigTrie slots/intervals when an account’s storage root changes. If old_root != new_root, those slots/intervals are no longer valid and can cause incorrect storage downloads/verification. New root should start empty (or be fully re-requested) rather than inheriting old data. crates/networking/p2p/sync.rs:293-316
  • High – loss of pending work on stale pivot: request_storage_ranges drains small_tries into local tasks via take_small_batch and never returns unfinished tasks to the tracker when the loop exits due to staleness. This loses remaining tries for subsequent attempts. Reinsert tasks_queue_not_started (and any unfinished/in-flight) into the tracker before returning on early break. crates/networking/p2p/snap/client.rs:545-566, crates/networking/p2p/snap/client.rs:829-832
  • Medium – busy spin when no tasks available: When tasks_queue_not_started is empty but completed_tasks < task_count, the loop continues without waiting, causing a tight spin. Add a small sleep or await worker_joinset.join_next() in that branch to avoid CPU burn and improve fairness. crates/networking/p2p/snap/client.rs:852-857

If you want, I can suggest concrete patches for the fixes above or add a small unit test for compute_intervals edge cases.


Automated review by OpenAI Codex · custom prompt

Copy link

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 3 potential issues.

View 6 additional findings in Devin Review.

Open in Devin Review

Comment on lines +238 to +239
let small = self.small_tries.remove(&root);
let accounts = small.map(|s| s.accounts).unwrap_or_default();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 promote_to_big creates BigTrie with empty accounts because SmallTrie was already drained from tracker

When a small trie is promoted to big, promote_to_big tries to recover accounts via self.small_tries.remove(&root), but the small trie was already removed from tracker.small_tries by take_small_batch() at the start of request_storage_ranges (lines 590-596). This causes the newly created BigTrie to have an empty accounts list.

Root Cause and Impact

The flow is:

  1. All small tries are drained from tracker.small_tries into StorageTask::SmallBatch tasks at crates/networking/p2p/snap/client.rs:590-596.
  2. A worker discovers a big trie and returns SmallPromotedToBig { big_root, big_trie, ... }.
  3. The handler calls tracker.promote_to_big(big_root, big_trie.slots, intervals.clone()) at line 741.
  4. Inside promote_to_big (crates/networking/p2p/sync.rs:238), self.small_tries.remove(&root) returns None.
  5. accounts becomes unwrap_or_default() → empty Vec.
  6. The BigTrie is stored with no accounts.

At lines 744-748, accounts is then fetched from tracker.big_tries.get(&big_root) and is empty. All subsequent BigInterval tasks carry empty accounts vectors.

When BigIntervalResult is processed (lines 782-788), if current_account_storages has been flushed to disk since the initial slots insertion (lines 630-668), a new AccountsWithStorage entry is created via or_insert_with with an empty accounts list. This means downloaded storage slots are written to disk without being associated with any accounts, causing data loss during storage trie insertion.

Impact: Big storage tries (accounts with large storage like DEX contracts) lose their account-to-storage association, potentially requiring full re-sync via the healing fallback.

Prompt for agents
In crates/networking/p2p/sync.rs, the promote_to_big method needs to accept an accounts list as a parameter instead of trying to recover it from small_tries (which has already been drained). Change the signature to: pub fn promote_to_big(&mut self, root: H256, accounts: Vec<H256>, first_slots: Vec<Slot>, intervals: Vec<Interval>). Use the passed-in accounts directly instead of self.small_tries.remove(&root). Then update the call site in crates/networking/p2p/snap/client.rs line 741 to pass big_trie.accounts (or a clone of it, since it is used earlier at lines 723 and 734) as the accounts parameter.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +372 to +376
let storage_density = start_hash_u256 / slot_count;
let chunk_size = storage_density
.checked_mul(U256::from(slots_per_chunk))
.unwrap_or(U256::MAX);
let chunk_count = (missing_storage_range / chunk_size).as_usize().max(1);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Division by zero in compute_intervals when storage_density is zero

In BigTrie::compute_intervals, if start_hash_u256 / slot_count evaluates to 0 (integer division), then chunk_size becomes 0, and missing_storage_range / chunk_size on line 376 panics with a division-by-zero.

Detailed Explanation

The computation at crates/networking/p2p/sync.rs:372-376:

let storage_density = start_hash_u256 / slot_count;
let chunk_size = storage_density
    .checked_mul(U256::from(slots_per_chunk))
    .unwrap_or(U256::MAX);
let chunk_count = (missing_storage_range / chunk_size).as_usize().max(1);

If start_hash_u256 < slot_count (e.g., the hash space covered so far is numerically smaller than the number of slots downloaded), storage_density is 0. Then chunk_size = 0 * 10_000 = 0, and missing_storage_range / 0 panics.

While extremely unlikely with keccak256-derived hashes (which are uniformly distributed over 2^256), this is still a potential panic path that should be guarded against. A malicious or buggy peer could potentially trigger this.

Impact: Panic in a spawned async task, causing silent data loss (see BUG-0003) for the affected storage trie.

Suggested change
let storage_density = start_hash_u256 / slot_count;
let chunk_size = storage_density
.checked_mul(U256::from(slots_per_chunk))
.unwrap_or(U256::MAX);
let chunk_count = (missing_storage_range / chunk_size).as_usize().max(1);
let storage_density = start_hash_u256 / slot_count;
let chunk_size = storage_density
.checked_mul(U256::from(slots_per_chunk))
.unwrap_or(U256::MAX)
.max(U256::one());
let chunk_count = (missing_storage_range / chunk_size).as_usize().max(1);
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

if old_intervals.is_empty() {
accounts_done.insert(*account, vec![]);
}
if let Some(Ok(result)) = worker_joinset.try_join_next() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Panicked worker tasks are silently ignored, causing data loss

At line 671, if let Some(Ok(result)) = worker_joinset.try_join_next() silently discards Some(Err(..)) results, which represent panicked tasks. The StorageTask data (tries or intervals) held by a panicked task is permanently lost.

Detailed Explanation

When a worker task panics (e.g., due to the division-by-zero in compute_intervals), try_join_next() returns Some(Err(JoinError)). The if let Some(Ok(result)) pattern silently ignores this. The storage data owned by the task (the SmallBatch tries or BigInterval data) is dropped.

The main loop at crates/networking/p2p/snap/client.rs:826-831 breaks when worker_joinset.is_empty(), so the loop will eventually terminate, but the data from panicked tasks is never re-queued or accounted for.

Impact: Any storage tries or intervals owned by a panicked task are silently lost, leading to incomplete storage downloads. This will eventually be caught by the healing phase, but it degrades sync performance.

Prompt for agents
In crates/networking/p2p/snap/client.rs line 671, change the pattern to handle Err results from try_join_next. When a task panics (Some(Err(join_error))), log a warning. Consider whether the task data can be recovered (it cannot from a JoinError, but you should at least log it). A minimal fix: change the if-let to match on Some(result) and handle the Err case with a warning log, e.g. Some(Err(e)) => { warn!("Worker task panicked: {e:?}"); }
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@github-actions
Copy link

github-actions bot commented Feb 10, 2026

Benchmark Block Execution Results Comparison Against Main

Command Mean [s] Min [s] Max [s] Relative
base 64.072 ± 0.402 63.585 64.712 1.00
head 64.219 ± 0.325 63.385 64.549 1.00 ± 0.01

@fedacking
Copy link
Contributor Author

fedacking commented Feb 10, 2026

One thing I think would be very useful is adding monitoring to request_storage_ranges so we can observe what's happening during the storage download phase. Right now it's hard to tell the state of progress and the mix of work being done.

Concretely, I'd like to see periodic debug! logs (or at the very least before/after each request_storage_ranges call) that include:

  1. Small vs big tries count — tracker.small_tries.len() and tracker.big_tries.len() separately. We already log tracker.remaining_count() in snap_sync.rs but that collapses both into one
    number, which hides the distribution.
  2. Total intervals across big tries — the sum of intervals.len() for each entry in tracker.big_tries. This tells us how much sub-range work is still pending for the large tries.
  3. Number of small batch vs big interval requests sent — how many SmallBatch and BigInterval tasks were actually dispatched to workers during the call.

This would give us good visibility into whether storage download is making progress, whether tries are getting promoted from small to big, and how the interval-based download is evolving over successive attempts. The existing metrics infrastructure (METRICS) seems like the right place to wire these into.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

L1 Ethereum client performance Block execution throughput and performance in general

Projects

Status: No status
Status: Todo

Development

Successfully merging this pull request may close these issues.

1 participant