-
Notifications
You must be signed in to change notification settings - Fork 162
perf(l1): refactor storage download to use StorageTrieTracker #6171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🤖 Kimi Code ReviewReview SummaryThis PR refactors the snap sync storage download logic by introducing a new Issues Found1. Potential Panic in
|
🤖 Claude Code ReviewBased on my review of PR #6171, here's my code review: Code Review - PR #6171: Refactor storage download to use StorageTrieTrackerOverall AssessmentThis is a substantial refactoring that improves code organization and eliminates clones. The changes move from index-based referencing to a more structured approach using enums and dedicated tracker types. The refactoring is well-conceived, but there are several issues that need attention. Critical Issues1. Potential Panic in
|
Lines of code reportTotal lines added: Detailed view |
Greptile OverviewGreptile SummaryThis PR refactors the storage download subsystem to use a cleaner Key improvements:
Minor issue found:
The refactor maintains equivalent functionality while improving code clarity and reducing unnecessary data copies. Tests pass and the design aligns well with the documented plan in issue #6170. Confidence Score: 4/5
|
| Filename | Overview |
|---|---|
| crates/networking/p2p/snap/client.rs | Major refactor of storage download logic using new StorageTask/StorageTaskResult enums and JoinSet-based concurrency; replaced clone-heavy channel communication with structured data flow |
| crates/networking/p2p/sync.rs | Introduced new StorageTrieTracker with SmallTrie/BigTrie structures to replace index-based AccountStorageRoots; added interval computation helper for big tries |
Sequence Diagram
sequenceDiagram
participant Main as request_storage_ranges
participant Tracker as StorageTrieTracker
participant Queue as Task Queue
participant Worker as Worker Tasks
participant Peer as Peer Network
Main->>Tracker: take_small_batch(STORAGE_BATCH_SIZE)
Tracker-->>Main: SmallBatch tasks
Main->>Queue: Queue SmallBatch tasks
loop For each BigTrie
Main->>Tracker: Get intervals from big_tries
Main->>Queue: Queue BigInterval tasks
end
loop Until all tasks complete
Main->>Worker: Spawn worker with task
Worker->>Peer: GetStorageRanges request
Peer-->>Worker: StorageRanges response
alt SmallBatch completed
Worker-->>Main: SmallComplete
Main->>Main: Write to disk buffer
Main->>Queue: Re-queue remaining tries
else SmallBatch failed
Worker-->>Main: SmallFailed
Main->>Queue: Re-queue all tries
else Small promoted to big
Worker-->>Main: SmallPromotedToBig
Main->>Tracker: promote_to_big()
Main->>Main: Compute intervals
Main->>Queue: Queue BigInterval tasks
else BigInterval result
Worker-->>Main: BigIntervalResult
Main->>Main: Append slots to buffer
Main->>Queue: Re-queue remaining interval
end
end
Main->>Main: Flush disk buffers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 1 comment
| tx.send(empty_task_result).await.ok(); | ||
| return Ok(()); | ||
|
|
||
| if (slots.is_empty() && proof.is_empty()) || slots.is_empty() || slots.len() > tries.len() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
redundant condition: slots.is_empty() is checked twice
| if (slots.is_empty() && proof.is_empty()) || slots.is_empty() || slots.len() > tries.len() { | |
| if slots.is_empty() || proof.is_empty() || slots.len() > tries.len() { |
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/snap/client.rs
Line: 1166:1166
Comment:
redundant condition: `slots.is_empty()` is checked twice
```suggestion
if slots.is_empty() || proof.is_empty() || slots.len() > tries.len() {
```
How can I resolve this? If you propose a fix, please make it concise.
🤖 Codex Code ReviewFindings
If you want, I can suggest concrete patches for the fixes above or add a small unit test for Automated review by OpenAI Codex · custom prompt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| let small = self.small_tries.remove(&root); | ||
| let accounts = small.map(|s| s.accounts).unwrap_or_default(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 promote_to_big creates BigTrie with empty accounts because SmallTrie was already drained from tracker
When a small trie is promoted to big, promote_to_big tries to recover accounts via self.small_tries.remove(&root), but the small trie was already removed from tracker.small_tries by take_small_batch() at the start of request_storage_ranges (lines 590-596). This causes the newly created BigTrie to have an empty accounts list.
Root Cause and Impact
The flow is:
- All small tries are drained from
tracker.small_triesintoStorageTask::SmallBatchtasks atcrates/networking/p2p/snap/client.rs:590-596. - A worker discovers a big trie and returns
SmallPromotedToBig { big_root, big_trie, ... }. - The handler calls
tracker.promote_to_big(big_root, big_trie.slots, intervals.clone())at line 741. - Inside
promote_to_big(crates/networking/p2p/sync.rs:238),self.small_tries.remove(&root)returnsNone. accountsbecomesunwrap_or_default()→ emptyVec.- The
BigTrieis stored with no accounts.
At lines 744-748, accounts is then fetched from tracker.big_tries.get(&big_root) and is empty. All subsequent BigInterval tasks carry empty accounts vectors.
When BigIntervalResult is processed (lines 782-788), if current_account_storages has been flushed to disk since the initial slots insertion (lines 630-668), a new AccountsWithStorage entry is created via or_insert_with with an empty accounts list. This means downloaded storage slots are written to disk without being associated with any accounts, causing data loss during storage trie insertion.
Impact: Big storage tries (accounts with large storage like DEX contracts) lose their account-to-storage association, potentially requiring full re-sync via the healing fallback.
Prompt for agents
In crates/networking/p2p/sync.rs, the promote_to_big method needs to accept an accounts list as a parameter instead of trying to recover it from small_tries (which has already been drained). Change the signature to: pub fn promote_to_big(&mut self, root: H256, accounts: Vec<H256>, first_slots: Vec<Slot>, intervals: Vec<Interval>). Use the passed-in accounts directly instead of self.small_tries.remove(&root). Then update the call site in crates/networking/p2p/snap/client.rs line 741 to pass big_trie.accounts (or a clone of it, since it is used earlier at lines 723 and 734) as the accounts parameter.
Was this helpful? React with 👍 or 👎 to provide feedback.
| let storage_density = start_hash_u256 / slot_count; | ||
| let chunk_size = storage_density | ||
| .checked_mul(U256::from(slots_per_chunk)) | ||
| .unwrap_or(U256::MAX); | ||
| let chunk_count = (missing_storage_range / chunk_size).as_usize().max(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 Division by zero in compute_intervals when storage_density is zero
In BigTrie::compute_intervals, if start_hash_u256 / slot_count evaluates to 0 (integer division), then chunk_size becomes 0, and missing_storage_range / chunk_size on line 376 panics with a division-by-zero.
Detailed Explanation
The computation at crates/networking/p2p/sync.rs:372-376:
let storage_density = start_hash_u256 / slot_count;
let chunk_size = storage_density
.checked_mul(U256::from(slots_per_chunk))
.unwrap_or(U256::MAX);
let chunk_count = (missing_storage_range / chunk_size).as_usize().max(1);
If start_hash_u256 < slot_count (e.g., the hash space covered so far is numerically smaller than the number of slots downloaded), storage_density is 0. Then chunk_size = 0 * 10_000 = 0, and missing_storage_range / 0 panics.
While extremely unlikely with keccak256-derived hashes (which are uniformly distributed over 2^256), this is still a potential panic path that should be guarded against. A malicious or buggy peer could potentially trigger this.
Impact: Panic in a spawned async task, causing silent data loss (see BUG-0003) for the affected storage trie.
| let storage_density = start_hash_u256 / slot_count; | |
| let chunk_size = storage_density | |
| .checked_mul(U256::from(slots_per_chunk)) | |
| .unwrap_or(U256::MAX); | |
| let chunk_count = (missing_storage_range / chunk_size).as_usize().max(1); | |
| let storage_density = start_hash_u256 / slot_count; | |
| let chunk_size = storage_density | |
| .checked_mul(U256::from(slots_per_chunk)) | |
| .unwrap_or(U256::MAX) | |
| .max(U256::one()); | |
| let chunk_count = (missing_storage_range / chunk_size).as_usize().max(1); |
Was this helpful? React with 👍 or 👎 to provide feedback.
| if old_intervals.is_empty() { | ||
| accounts_done.insert(*account, vec![]); | ||
| } | ||
| if let Some(Ok(result)) = worker_joinset.try_join_next() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 Panicked worker tasks are silently ignored, causing data loss
At line 671, if let Some(Ok(result)) = worker_joinset.try_join_next() silently discards Some(Err(..)) results, which represent panicked tasks. The StorageTask data (tries or intervals) held by a panicked task is permanently lost.
Detailed Explanation
When a worker task panics (e.g., due to the division-by-zero in compute_intervals), try_join_next() returns Some(Err(JoinError)). The if let Some(Ok(result)) pattern silently ignores this. The storage data owned by the task (the SmallBatch tries or BigInterval data) is dropped.
The main loop at crates/networking/p2p/snap/client.rs:826-831 breaks when worker_joinset.is_empty(), so the loop will eventually terminate, but the data from panicked tasks is never re-queued or accounted for.
Impact: Any storage tries or intervals owned by a panicked task are silently lost, leading to incomplete storage downloads. This will eventually be caught by the healing phase, but it degrades sync performance.
Prompt for agents
In crates/networking/p2p/snap/client.rs line 671, change the pattern to handle Err results from try_join_next. When a task panics (Some(Err(join_error))), log a warning. Consider whether the task data can be recovered (it cannot from a JoinError, but you should at least log it). A minimal fix: change the if-let to match on Some(result) and handle the Err case with a warning log, e.g. Some(Err(e)) => { warn!("Worker task panicked: {e:?}"); }
Was this helpful? React with 👍 or 👎 to provide feedback.
Benchmark Block Execution Results Comparison Against Main
|
|
One thing I think would be very useful is adding monitoring to request_storage_ranges so we can observe what's happening during the storage download phase. Right now it's hard to tell the state of progress and the mix of work being done. Concretely, I'd like to see periodic debug! logs (or at the very least before/after each request_storage_ranges call) that include:
This would give us good visibility into whether storage download is making progress, whether tries are getting promoted from small to big, and how the interval-based download is evolving over successive attempts. The existing metrics infrastructure (METRICS) seems like the right place to wire these into. |
Summary
AccountStorageRootswithStorageTrieTrackerthroughout snap sync, eliminating index-based referencing and theaccounts_by_root_hashintermediate structureStorageTask/StorageTaskResultenums that move trie data into tasks and back in results, removing clones and simplifying the download loopJoinSet+try_join_next) instead of channels for worker communication inrequest_storage_rangesBigTrie::compute_intervalshelper from the inline chunking logicThe full plan for this PR is documented in #6170
Test plan
cargo check -p ethrex-p2pcompiles cleanly (default + rocksdb features)cargo clippy -p ethrex-p2ppasses with no warnings (default + rocksdb features)cargo test -p ethrex-p2p— all 38 tests pass