Skip to content

Conversation

@iovoid
Copy link
Contributor

@iovoid iovoid commented Feb 6, 2026

Motivation

We don't want panics in production code, since they can ungracefully crash the node.

Description

Removes some panics.

This makes progress towards goal 1.1 of the UX/DevEx roadmap

@github-actions github-actions bot added the L1 Ethereum client label Feb 6, 2026
@github-actions
Copy link

github-actions bot commented Feb 6, 2026

🤖 Kimi Code Review

Review Summary

This PR improves error handling across several components by replacing panics with proper error propagation. The changes are generally positive, but there are some issues to address.

Critical Issues

  1. Thread spawn error handling in blockchain.rs: Lines 378-382, 396-399, 416-419

    • The .map_err() calls are incorrectly converting thread spawn failures to ChainError but the function signature is returning Result<_, ChainError> which means these will be double-wrapped errors.
    • Fix: Remove .map_err() and let the ? operator handle the conversion, or change the return type to handle the spawn errors directly.
  2. Inconsistent error handling in blockchain.rs: The scope closure now returns Result<_, ChainError> but the pattern matching at the end (lines 440-450) still uses .unwrap_or_else() which could panic. These should use ? to propagate errors properly.

Medium Issues

  1. Incomplete error propagation in state_healing.rs: Line 297

    • The comment says "avoid out of order deletes" but the code only handles one task completion. If multiple tasks complete, others are ignored.
    • Fix: Consider handling all completed tasks, not just one.
  2. Resource cleanup in state_healing.rs: Lines 321-325, 335-339

    • The join_all().await loops could be simplified to while let Some(result) = db_joinset.join_next().await { result??; } to ensure all tasks are properly handled.

Minor Issues

  1. Typo in state_healing.rs: Line 334 - "Finisehd" should be "Finished"

  2. Error type consistency: In storage_healing.rs line 444-456, the response to oversized node responses is good, but consider if this should be treated as a more severe peer misbehavior.

Positive Changes

  • ✅ Proper error handling instead of panics in GuestProgramState
  • ✅ Good error propagation in peer_handler.rs
  • ✅ Appropriate logging for peer misbehavior in storage_healing.rs
  • ✅ Proper error handling for genesis file deserialization

The PR is moving in the right direction by eliminating panics, but needs to address the thread spawn error handling to be fully correct.


Automated review by Kimi (Moonshot AI) · custom prompt

@github-actions
Copy link

github-actions bot commented Feb 6, 2026

Lines of code report

Total lines added: 25
Total lines removed: 12
Total lines changed: 37

Detailed view
+-------------------------------------------------------+-------+------+
| File                                                  | Lines | Diff |
+-------------------------------------------------------+-------+------+
| ethrex/crates/blockchain/blockchain.rs                | 1895  | +8   |
+-------------------------------------------------------+-------+------+
| ethrex/crates/blockchain/dev/block_producer.rs        | 123   | +3   |
+-------------------------------------------------------+-------+------+
| ethrex/crates/common/types/block_execution_witness.rs | 406   | -12  |
+-------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/healing/state.rs    | 389   | +2   |
+-------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/healing/storage.rs  | 627   | +12  |
+-------------------------------------------------------+-------+------+

@greptile-apps
Copy link

greptile-apps bot commented Feb 6, 2026

Greptile Overview

Greptile Summary

  • Replaced several expect/panic! sites across block execution, dev block production, networking, sync healing, and genesis loading with structured error propagation.
  • Updated P2P network startup to return a typed error when UDP bind fails, instead of crashing.
  • Hardened state/storage healing paths against internal panics by returning SyncError/PeerHandlerError and scoring/requeueing misbehaving peers.
  • Adjusted zkVM witness state trie update logic to propagate trie/decoding errors instead of panicking, preserving existing update semantics.

Confidence Score: 4/5

  • This PR is likely safe to merge and primarily improves robustness by removing panics, with limited behavioral risk in error paths.
  • Most changes are mechanical panic-to-Result conversions. Main risk is in sync healing paths where previously-unreachable invariants now return errors (possibly aborting healing) and in altered handling of oversized peer responses; these should be validated with tests/sync scenarios.
  • crates/networking/p2p/sync/state_healing.rs and crates/networking/p2p/peer_handler.rs

Important Files Changed

Filename Overview
crates/blockchain/blockchain.rs Replaced thread spawn panics with error propagation in block execution pipeline; behavior looks correct, but thread panic join paths still convert to custom errors.
crates/blockchain/dev/block_producer.rs Replaced payload_id unwrap panic with retry+log; control flow remains consistent with existing retry handling.
crates/common/types/block_execution_witness.rs Converted trie insert/get/remove and decode panics into proper error propagation via GuestProgramStateError; functional behavior preserved.
crates/networking/p2p/network.rs Replaced UDP bind expect with ? and added NetworkError variant to propagate bind failures cleanly.
crates/networking/p2p/peer_handler.rs Replaced state trie access expects and a panic in storage range handling with PeerHandlerError returns; potential for new hard error path if invariants break.
crates/networking/p2p/sync/state_healing.rs Replaced multiple panics in healing DB writes and membatch parent lookup with error propagation; new error path may abort healing on malformed/missing parent paths.
crates/networking/p2p/sync/storage_healing.rs Replaced panic on oversized peer response with warn+failure scoring+requeue; avoids crashing but changes how bad peers are handled.
crates/storage/store.rs Replaced genesis JSON deserialization expect with StoreError propagation, preventing crash on malformed genesis files.

Sequence Diagram

sequenceDiagram
    participant BC as Blockchain::execute_block_pipeline
    participant Warm as warmer thread
    participant Exec as execution thread
    participant Merk as merkleizer thread
    participant Store as Store/TrieDB

    BC->>Warm: spawn_scoped(warm_block)
    BC->>Exec: spawn_scoped(execute_block_pipeline)
    Exec-->>BC: join() -> Result<(exec_result, t_exec_end), ChainError>
    BC->>Merk: spawn_scoped(handle_merkleization)
    Merk-->>BC: join() -> Result<(updates_list, acc_updates, t_merk_end), StoreError>

    Exec->>Store: read/write via caching store
    Merk->>Store: apply merkleization updates

    BC-->>BC: validate_gas/receipts/requests
    BC-->>BC: propagate spawn/join errors via ChainError/StoreError
Loading

@github-actions
Copy link

github-actions bot commented Feb 6, 2026

🤖 Codex Code Review

Looks good overall. No correctness or consensus‑critical issues spotted in this diff. A few minor suggestions for robustness/maintainability:

  • crates/networking/p2p/sync/storage_healing.rs:443-456: When a peer returns more nodes than requested, you requeue and record failure. Consider escalating this to a stronger peer penalty (e.g., temporary ban/disconnect) since it’s a protocol violation and could be used for resource abuse.
  • crates/networking/p2p/sync/state_healing.rs:321-337: You now propagate JoinSet task errors; consider logging the JoinError/SyncError before returning to aid diagnosing background DB write failures.

If you want, I can also skim the full file context for any subtle EVM/state‑trie interactions.


Automated review by OpenAI Codex · custom prompt

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 443 to +456
if request.requests.len() < nodes_size {
panic!("The node responded with more data than us!");
warn!(
peer = ?request.peer_id,
requested = request.requests.len(),
received = nodes_size,
"Peer responded with more trie nodes than requested"
);
*failed_downloads += 1;
peer_handler
.peer_table
.record_failure(&request.peer_id)
.await?;
download_queue.extend(request.requests);
return Ok(None);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drops extra peer data

In zip_requeue_node_responses_score_peer, when nodes_size > request.requests.len() you treat it as a peer failure and requeue the original requests, but you silently drop the extra trie nodes that were already received (trie_nodes.nodes), even though they may contain useful nodes for other pending requests. If this situation can occur due to a peer bug/misalbehavior (or message framing issues), this code will now reliably discard those bytes and redo requests, potentially causing repeated download loops.

If you want to ignore the response, consider explicitly documenting that extra nodes are discarded by design (and ensure the peer is disconnected/penalized enough to prevent repeated churn), or alternatively keep the first request.requests.len() nodes and process those while penalizing the peer for the overflow.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/storage_healing.rs
Line: 443:456

Comment:
**Drops extra peer data**

In `zip_requeue_node_responses_score_peer`, when `nodes_size > request.requests.len()` you treat it as a peer failure and requeue the original requests, but you silently drop the *extra* trie nodes that were already received (`trie_nodes.nodes`), even though they may contain useful nodes for other pending requests. If this situation can occur due to a peer bug/misalbehavior (or message framing issues), this code will now reliably discard those bytes and redo requests, potentially causing repeated download loops.

If you want to ignore the response, consider explicitly documenting that extra nodes are discarded by design (and ensure the peer is disconnected/penalized enough to prevent repeated churn), or alternatively keep the first `request.requests.len()` nodes and process those while penalizing the peer for the overflow.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior goes against spec, a node exhibiting this behavior is malfunctioning.

The peer is in fact penalized.

Comment on lines 397 to 399
let mut membatch_entry = membatch.remove(parent_path).ok_or(SyncError::CorruptPath)?;

membatch_entry.children_not_in_storage_count -= 1;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New hard error path

commit_node now returns Err(SyncError::CorruptPath) if membatch.remove(parent_path) is missing. Previously this was a panic!, but now the error will propagate and abort heal_state_trie (via heal_state_batch), potentially stopping healing for the entire cycle. If membatch can be cleared/partially persisted between iterations (e.g., after nodes_to_write flushes) while child nodes are still being committed, this can become a reachable failure mode and regress healing progress from “crash loud” to “silently give up healing”. Consider confirming (or asserting) the invariant that every committed child’s parent_path must always be present in membatch at that point, or handle the missing-parent case by re-queuing/recovering rather than returning a fatal error.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/state_healing.rs
Line: 397:399

Comment:
**New hard error path**

`commit_node` now returns `Err(SyncError::CorruptPath)` if `membatch.remove(parent_path)` is missing. Previously this was a `panic!`, but now the error will propagate and abort `heal_state_trie` (via `heal_state_batch`), potentially stopping healing for the entire cycle. If `membatch` can be cleared/partially persisted between iterations (e.g., after `nodes_to_write` flushes) while child nodes are still being committed, this can become a reachable failure mode and regress healing progress from “crash loud” to “silently give up healing”. Consider confirming (or asserting) the invariant that every committed child’s `parent_path` must always be present in `membatch` at that point, or handle the missing-parent case by re-queuing/recovering rather than returning a fatal error.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Contributor Author

@iovoid iovoid Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The invariant is correct.

@greptile-apps
Copy link

greptile-apps bot commented Feb 6, 2026

Additional Comments (1)

crates/networking/p2p/peer_handler.rs
Behavior change to unrecoverable

In the storage-range processing branch, the previous panic!("Should have found the account hash") is now an UnrecoverableError return. Because this path sits inside the request loop, hitting this condition will now terminate request_storage_ranges entirely (rather than crash), which may leave sync stuck without an obvious retry path. If this condition is realistically reachable due to inconsistent accounts_by_root_hash vs accounts_with_storage_root contents (e.g., data races between healing and range requests, or partial updates), it should likely be treated as a recoverable error that requeues the task instead of bailing out of storage-range downloading.

(Also appears in the earlier state-trie lookup conversion in this function: missing account now yields UnrecoverableError, which will similarly abort the whole request.)

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/peer_handler.rs
Line: 1329:1339

Comment:
**Behavior change to unrecoverable**

In the storage-range processing branch, the previous `panic!("Should have found the account hash")` is now an `UnrecoverableError` return. Because this path sits inside the request loop, hitting this condition will now terminate `request_storage_ranges` entirely (rather than crash), which may leave sync stuck without an obvious retry path. If this condition is realistically reachable due to inconsistent `accounts_by_root_hash` vs `accounts_with_storage_root` contents (e.g., data races between healing and range requests, or partial updates), it should likely be treated as a recoverable error that requeues the task instead of bailing out of storage-range downloading.

(Also appears in the earlier state-trie lookup conversion in this function: missing account now yields `UnrecoverableError`, which will similarly abort the whole request.)

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 58 to 68
let payload_id = match fork_choice_response.payload_id {
Some(id) => id,
None => {
tracing::error!(
"Failed to produce block: payload_id is None in ForkChoiceResponse"
);
sleep(Duration::from_millis(300)).await;
tries += 1;
continue;
}
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let Some = else pattern here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved in 5511f8c

# Conflicts:
#	crates/networking/p2p/peer_handler.rs
#	crates/networking/p2p/sync/healing/state.rs
# Conflicts:
#	crates/blockchain/blockchain.rs
#[error("Failed to start Tx Broadcaster: {0}")]
TxBroadcasterError(#[from] TxBroadcasterError),
#[error("Failed to bind UDP socket: {0}")]
UdpSocketError(#[from] std::io::Error),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Adding a blanket From<std::io::Error> to NetworkError means any io::Error from any source in a function returning Result<_, NetworkError> will silently become UdpSocketError. Currently start_network only has the one UDP bind site, but this could be misleading if the function grows. A scoped .map_err() at the call site would be more precise:

let udp_socket = UdpSocket::bind(context.local_node.udp_addr())
    .await
    .map_err(|e| NetworkError::UdpSocketError(e))?;

and drop the #[from] on the variant.

});
let mut healing_queue_entry = healing_queue
.remove(parent_path)
.ok_or(SyncError::CorruptPath)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SyncError::CorruptPath is semantically wrong here — that variant was introduced for filesystem path failures (create_dir_all, DirEntry), not for a missing parent in the healing queue. More importantly, the old panic message included parent_path and path which are very useful for debugging sync issues; this replacement loses that context entirely.

Consider a dedicated variant like SyncError::HealingQueueInconsistency(String) or at least something that carries the parent/child path info, e.g.:

let mut healing_queue_entry = healing_queue.remove(parent_path).ok_or_else(|| {
    SyncError::Custom(format!("Parent not found in healing queue. Parent: {parent_path:?}, path: {path:?}"))
})?

(assuming a Custom(String) variant exists or is added).

@github-project-automation github-project-automation bot moved this to In Progress in ethrex_l1 Feb 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

L1 Ethereum client

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

4 participants