-
Notifications
You must be signed in to change notification settings - Fork 162
chore(l1): replace unjustified panics with proper error propagation #6147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🤖 Kimi Code ReviewReview SummaryThis PR improves error handling across several components by replacing panics with proper error propagation. The changes are generally positive, but there are some issues to address. Critical Issues
Medium Issues
Minor Issues
Positive Changes
The PR is moving in the right direction by eliminating panics, but needs to address the thread spawn error handling to be fully correct. Automated review by Kimi (Moonshot AI) · custom prompt |
Lines of code reportTotal lines added: Detailed view |
Greptile OverviewGreptile Summary
Confidence Score: 4/5
|
| Filename | Overview |
|---|---|
| crates/blockchain/blockchain.rs | Replaced thread spawn panics with error propagation in block execution pipeline; behavior looks correct, but thread panic join paths still convert to custom errors. |
| crates/blockchain/dev/block_producer.rs | Replaced payload_id unwrap panic with retry+log; control flow remains consistent with existing retry handling. |
| crates/common/types/block_execution_witness.rs | Converted trie insert/get/remove and decode panics into proper error propagation via GuestProgramStateError; functional behavior preserved. |
| crates/networking/p2p/network.rs | Replaced UDP bind expect with ? and added NetworkError variant to propagate bind failures cleanly. |
| crates/networking/p2p/peer_handler.rs | Replaced state trie access expects and a panic in storage range handling with PeerHandlerError returns; potential for new hard error path if invariants break. |
| crates/networking/p2p/sync/state_healing.rs | Replaced multiple panics in healing DB writes and membatch parent lookup with error propagation; new error path may abort healing on malformed/missing parent paths. |
| crates/networking/p2p/sync/storage_healing.rs | Replaced panic on oversized peer response with warn+failure scoring+requeue; avoids crashing but changes how bad peers are handled. |
| crates/storage/store.rs | Replaced genesis JSON deserialization expect with StoreError propagation, preventing crash on malformed genesis files. |
Sequence Diagram
sequenceDiagram
participant BC as Blockchain::execute_block_pipeline
participant Warm as warmer thread
participant Exec as execution thread
participant Merk as merkleizer thread
participant Store as Store/TrieDB
BC->>Warm: spawn_scoped(warm_block)
BC->>Exec: spawn_scoped(execute_block_pipeline)
Exec-->>BC: join() -> Result<(exec_result, t_exec_end), ChainError>
BC->>Merk: spawn_scoped(handle_merkleization)
Merk-->>BC: join() -> Result<(updates_list, acc_updates, t_merk_end), StoreError>
Exec->>Store: read/write via caching store
Merk->>Store: apply merkleization updates
BC-->>BC: validate_gas/receipts/requests
BC-->>BC: propagate spawn/join errors via ChainError/StoreError
🤖 Codex Code ReviewLooks good overall. No correctness or consensus‑critical issues spotted in this diff. A few minor suggestions for robustness/maintainability:
If you want, I can also skim the full file context for any subtle EVM/state‑trie interactions. Automated review by OpenAI Codex · custom prompt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8 files reviewed, 3 comments
| if request.requests.len() < nodes_size { | ||
| panic!("The node responded with more data than us!"); | ||
| warn!( | ||
| peer = ?request.peer_id, | ||
| requested = request.requests.len(), | ||
| received = nodes_size, | ||
| "Peer responded with more trie nodes than requested" | ||
| ); | ||
| *failed_downloads += 1; | ||
| peer_handler | ||
| .peer_table | ||
| .record_failure(&request.peer_id) | ||
| .await?; | ||
| download_queue.extend(request.requests); | ||
| return Ok(None); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drops extra peer data
In zip_requeue_node_responses_score_peer, when nodes_size > request.requests.len() you treat it as a peer failure and requeue the original requests, but you silently drop the extra trie nodes that were already received (trie_nodes.nodes), even though they may contain useful nodes for other pending requests. If this situation can occur due to a peer bug/misalbehavior (or message framing issues), this code will now reliably discard those bytes and redo requests, potentially causing repeated download loops.
If you want to ignore the response, consider explicitly documenting that extra nodes are discarded by design (and ensure the peer is disconnected/penalized enough to prevent repeated churn), or alternatively keep the first request.requests.len() nodes and process those while penalizing the peer for the overflow.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/storage_healing.rs
Line: 443:456
Comment:
**Drops extra peer data**
In `zip_requeue_node_responses_score_peer`, when `nodes_size > request.requests.len()` you treat it as a peer failure and requeue the original requests, but you silently drop the *extra* trie nodes that were already received (`trie_nodes.nodes`), even though they may contain useful nodes for other pending requests. If this situation can occur due to a peer bug/misalbehavior (or message framing issues), this code will now reliably discard those bytes and redo requests, potentially causing repeated download loops.
If you want to ignore the response, consider explicitly documenting that extra nodes are discarded by design (and ensure the peer is disconnected/penalized enough to prevent repeated churn), or alternatively keep the first `request.requests.len()` nodes and process those while penalizing the peer for the overflow.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior goes against spec, a node exhibiting this behavior is malfunctioning.
The peer is in fact penalized.
| let mut membatch_entry = membatch.remove(parent_path).ok_or(SyncError::CorruptPath)?; | ||
|
|
||
| membatch_entry.children_not_in_storage_count -= 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New hard error path
commit_node now returns Err(SyncError::CorruptPath) if membatch.remove(parent_path) is missing. Previously this was a panic!, but now the error will propagate and abort heal_state_trie (via heal_state_batch), potentially stopping healing for the entire cycle. If membatch can be cleared/partially persisted between iterations (e.g., after nodes_to_write flushes) while child nodes are still being committed, this can become a reachable failure mode and regress healing progress from “crash loud” to “silently give up healing”. Consider confirming (or asserting) the invariant that every committed child’s parent_path must always be present in membatch at that point, or handle the missing-parent case by re-queuing/recovering rather than returning a fatal error.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/state_healing.rs
Line: 397:399
Comment:
**New hard error path**
`commit_node` now returns `Err(SyncError::CorruptPath)` if `membatch.remove(parent_path)` is missing. Previously this was a `panic!`, but now the error will propagate and abort `heal_state_trie` (via `heal_state_batch`), potentially stopping healing for the entire cycle. If `membatch` can be cleared/partially persisted between iterations (e.g., after `nodes_to_write` flushes) while child nodes are still being committed, this can become a reachable failure mode and regress healing progress from “crash loud” to “silently give up healing”. Consider confirming (or asserting) the invariant that every committed child’s `parent_path` must always be present in `membatch` at that point, or handle the missing-parent case by re-queuing/recovering rather than returning a fatal error.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The invariant is correct.
Additional Comments (1)
In the storage-range processing branch, the previous (Also appears in the earlier state-trie lookup conversion in this function: missing account now yields Prompt To Fix With AIThis is a comment left during a code review.
Path: crates/networking/p2p/peer_handler.rs
Line: 1329:1339
Comment:
**Behavior change to unrecoverable**
In the storage-range processing branch, the previous `panic!("Should have found the account hash")` is now an `UnrecoverableError` return. Because this path sits inside the request loop, hitting this condition will now terminate `request_storage_ranges` entirely (rather than crash), which may leave sync stuck without an obvious retry path. If this condition is realistically reachable due to inconsistent `accounts_by_root_hash` vs `accounts_with_storage_root` contents (e.g., data races between healing and range requests, or partial updates), it should likely be treated as a recoverable error that requeues the task instead of bailing out of storage-range downloading.
(Also appears in the earlier state-trie lookup conversion in this function: missing account now yields `UnrecoverableError`, which will similarly abort the whole request.)
How can I resolve this? If you propose a fix, please make it concise. |
| let payload_id = match fork_choice_response.payload_id { | ||
| Some(id) => id, | ||
| None => { | ||
| tracing::error!( | ||
| "Failed to produce block: payload_id is None in ForkChoiceResponse" | ||
| ); | ||
| sleep(Duration::from_millis(300)).await; | ||
| tries += 1; | ||
| continue; | ||
| } | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let Some = else pattern here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improved in 5511f8c
# Conflicts: # crates/networking/p2p/peer_handler.rs # crates/networking/p2p/sync/healing/state.rs
# Conflicts: # crates/blockchain/blockchain.rs
| #[error("Failed to start Tx Broadcaster: {0}")] | ||
| TxBroadcasterError(#[from] TxBroadcasterError), | ||
| #[error("Failed to bind UDP socket: {0}")] | ||
| UdpSocketError(#[from] std::io::Error), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Adding a blanket From<std::io::Error> to NetworkError means any io::Error from any source in a function returning Result<_, NetworkError> will silently become UdpSocketError. Currently start_network only has the one UDP bind site, but this could be misleading if the function grows. A scoped .map_err() at the call site would be more precise:
let udp_socket = UdpSocket::bind(context.local_node.udp_addr())
.await
.map_err(|e| NetworkError::UdpSocketError(e))?;and drop the #[from] on the variant.
| }); | ||
| let mut healing_queue_entry = healing_queue | ||
| .remove(parent_path) | ||
| .ok_or(SyncError::CorruptPath)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SyncError::CorruptPath is semantically wrong here — that variant was introduced for filesystem path failures (create_dir_all, DirEntry), not for a missing parent in the healing queue. More importantly, the old panic message included parent_path and path which are very useful for debugging sync issues; this replacement loses that context entirely.
Consider a dedicated variant like SyncError::HealingQueueInconsistency(String) or at least something that carries the parent/child path info, e.g.:
let mut healing_queue_entry = healing_queue.remove(parent_path).ok_or_else(|| {
SyncError::Custom(format!("Parent not found in healing queue. Parent: {parent_path:?}, path: {path:?}"))
})?(assuming a Custom(String) variant exists or is added).
Motivation
We don't want panics in production code, since they can ungracefully crash the node.
Description
Removes some panics.
This makes progress towards goal 1.1 of the UX/DevEx roadmap