chore(l1): replace unjustified panics with proper error propagation #6147

iovoid · 2026-02-06T14:58:41Z

Motivation

We don't want panics in production code, since they can ungracefully crash the node.

Description

Removes some panics.

This makes progress towards goal 1.1 of the UX/DevEx roadmap

github-actions · 2026-02-06T14:59:10Z

🤖 Kimi Code Review

Review Summary

This PR improves error handling across several components by replacing panics with proper error propagation. The changes are generally positive, but there are some issues to address.

Critical Issues

Thread spawn error handling in blockchain.rs: Lines 378-382, 396-399, 416-419
- The .map_err() calls are incorrectly converting thread spawn failures to ChainError but the function signature is returning Result<_, ChainError> which means these will be double-wrapped errors.
- Fix: Remove .map_err() and let the ? operator handle the conversion, or change the return type to handle the spawn errors directly.
Inconsistent error handling in blockchain.rs: The scope closure now returns Result<_, ChainError> but the pattern matching at the end (lines 440-450) still uses .unwrap_or_else() which could panic. These should use ? to propagate errors properly.

Medium Issues

Incomplete error propagation in state_healing.rs: Line 297
- The comment says "avoid out of order deletes" but the code only handles one task completion. If multiple tasks complete, others are ignored.
- Fix: Consider handling all completed tasks, not just one.
Resource cleanup in state_healing.rs: Lines 321-325, 335-339
- The join_all().await loops could be simplified to while let Some(result) = db_joinset.join_next().await { result??; } to ensure all tasks are properly handled.

Minor Issues

Typo in state_healing.rs: Line 334 - "Finisehd" should be "Finished"
Error type consistency: In storage_healing.rs line 444-456, the response to oversized node responses is good, but consider if this should be treated as a more severe peer misbehavior.

Positive Changes

✅ Proper error handling instead of panics in GuestProgramState
✅ Good error propagation in peer_handler.rs
✅ Appropriate logging for peer misbehavior in storage_healing.rs
✅ Proper error handling for genesis file deserialization

The PR is moving in the right direction by eliminating panics, but needs to address the thread spawn error handling to be fully correct.

Automated review by Kimi (Moonshot AI) · custom prompt

github-actions · 2026-02-06T15:01:08Z

Lines of code report

Total lines added: 25
Total lines removed: 12
Total lines changed: 37

Detailed view

+-------------------------------------------------------+-------+------+
| File                                                  | Lines | Diff |
+-------------------------------------------------------+-------+------+
| ethrex/crates/blockchain/blockchain.rs                | 1895  | +8   |
+-------------------------------------------------------+-------+------+
| ethrex/crates/blockchain/dev/block_producer.rs        | 123   | +3   |
+-------------------------------------------------------+-------+------+
| ethrex/crates/common/types/block_execution_witness.rs | 406   | -12  |
+-------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/healing/state.rs    | 389   | +2   |
+-------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/healing/storage.rs  | 627   | +12  |
+-------------------------------------------------------+-------+------+

greptile-apps · 2026-02-06T15:01:26Z

Greptile Overview

Greptile Summary

Replaced several expect/panic! sites across block execution, dev block production, networking, sync healing, and genesis loading with structured error propagation.
Updated P2P network startup to return a typed error when UDP bind fails, instead of crashing.
Hardened state/storage healing paths against internal panics by returning SyncError/PeerHandlerError and scoring/requeueing misbehaving peers.
Adjusted zkVM witness state trie update logic to propagate trie/decoding errors instead of panicking, preserving existing update semantics.

Confidence Score: 4/5

This PR is likely safe to merge and primarily improves robustness by removing panics, with limited behavioral risk in error paths.
Most changes are mechanical panic-to-Result conversions. Main risk is in sync healing paths where previously-unreachable invariants now return errors (possibly aborting healing) and in altered handling of oversized peer responses; these should be validated with tests/sync scenarios.
crates/networking/p2p/sync/state_healing.rs and crates/networking/p2p/peer_handler.rs

Important Files Changed

Filename	Overview
crates/blockchain/blockchain.rs	Replaced thread spawn panics with error propagation in block execution pipeline; behavior looks correct, but thread panic join paths still convert to custom errors.
crates/blockchain/dev/block_producer.rs	Replaced payload_id unwrap panic with retry+log; control flow remains consistent with existing retry handling.
crates/common/types/block_execution_witness.rs	Converted trie insert/get/remove and decode panics into proper error propagation via GuestProgramStateError; functional behavior preserved.
crates/networking/p2p/network.rs	Replaced UDP bind expect with `?` and added NetworkError variant to propagate bind failures cleanly.
crates/networking/p2p/peer_handler.rs	Replaced state trie access expects and a panic in storage range handling with PeerHandlerError returns; potential for new hard error path if invariants break.
crates/networking/p2p/sync/state_healing.rs	Replaced multiple panics in healing DB writes and membatch parent lookup with error propagation; new error path may abort healing on malformed/missing parent paths.
crates/networking/p2p/sync/storage_healing.rs	Replaced panic on oversized peer response with warn+failure scoring+requeue; avoids crashing but changes how bad peers are handled.
crates/storage/store.rs	Replaced genesis JSON deserialization expect with StoreError propagation, preventing crash on malformed genesis files.

Sequence Diagram

sequenceDiagram
    participant BC as Blockchain::execute_block_pipeline
    participant Warm as warmer thread
    participant Exec as execution thread
    participant Merk as merkleizer thread
    participant Store as Store/TrieDB

    BC->>Warm: spawn_scoped(warm_block)
    BC->>Exec: spawn_scoped(execute_block_pipeline)
    Exec-->>BC: join() -> Result<(exec_result, t_exec_end), ChainError>
    BC->>Merk: spawn_scoped(handle_merkleization)
    Merk-->>BC: join() -> Result<(updates_list, acc_updates, t_merk_end), StoreError>

    Exec->>Store: read/write via caching store
    Merk->>Store: apply merkleization updates

    BC-->>BC: validate_gas/receipts/requests
    BC-->>BC: propagate spawn/join errors via ChainError/StoreError

github-actions · 2026-02-06T15:01:29Z

🤖 Codex Code Review

Looks good overall. No correctness or consensus‑critical issues spotted in this diff. A few minor suggestions for robustness/maintainability:

crates/networking/p2p/sync/storage_healing.rs:443-456: When a peer returns more nodes than requested, you requeue and record failure. Consider escalating this to a stronger peer penalty (e.g., temporary ban/disconnect) since it’s a protocol violation and could be used for resource abuse.
crates/networking/p2p/sync/state_healing.rs:321-337: You now propagate JoinSet task errors; consider logging the JoinError/SyncError before returning to aid diagnosing background DB write failures.

If you want, I can also skim the full file context for any subtle EVM/state‑trie interactions.

Automated review by OpenAI Codex · custom prompt

greptile-apps

_{8 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-06T15:01:30Z

crates/networking/p2p/sync/healing/storage.rs

    if request.requests.len() < nodes_size {
-        panic!("The node responded with more data than us!");
+        warn!(
+            peer = ?request.peer_id,
+            requested = request.requests.len(),
+            received = nodes_size,
+            "Peer responded with more trie nodes than requested"
+        );
+        *failed_downloads += 1;
+        peer_handler
+            .peer_table
+            .record_failure(&request.peer_id)
+            .await?;
+        download_queue.extend(request.requests);
+        return Ok(None);


Drops extra peer data

In zip_requeue_node_responses_score_peer, when nodes_size > request.requests.len() you treat it as a peer failure and requeue the original requests, but you silently drop the extra trie nodes that were already received (trie_nodes.nodes), even though they may contain useful nodes for other pending requests. If this situation can occur due to a peer bug/misalbehavior (or message framing issues), this code will now reliably discard those bytes and redo requests, potentially causing repeated download loops.

If you want to ignore the response, consider explicitly documenting that extra nodes are discarded by design (and ensure the peer is disconnected/penalized enough to prevent repeated churn), or alternatively keep the first request.requests.len() nodes and process those while penalizing the peer for the overflow.

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/networking/p2p/sync/storage_healing.rs Line: 443:456 Comment: **Drops extra peer data** In `zip_requeue_node_responses_score_peer`, when `nodes_size > request.requests.len()` you treat it as a peer failure and requeue the original requests, but you silently drop the *extra* trie nodes that were already received (`trie_nodes.nodes`), even though they may contain useful nodes for other pending requests. If this situation can occur due to a peer bug/misalbehavior (or message framing issues), this code will now reliably discard those bytes and redo requests, potentially causing repeated download loops. If you want to ignore the response, consider explicitly documenting that extra nodes are discarded by design (and ensure the peer is disconnected/penalized enough to prevent repeated churn), or alternatively keep the first `request.requests.len()` nodes and process those while penalizing the peer for the overflow. How can I resolve this? If you propose a fix, please make it concise.

The behavior goes against spec, a node exhibiting this behavior is malfunctioning.

The peer is in fact penalized.

greptile-apps · 2026-02-06T15:01:31Z

crates/networking/p2p/sync/state_healing.rs

+    let mut membatch_entry = membatch.remove(parent_path).ok_or(SyncError::CorruptPath)?;

    membatch_entry.children_not_in_storage_count -= 1;


New hard error path

commit_node now returns Err(SyncError::CorruptPath) if membatch.remove(parent_path) is missing. Previously this was a panic!, but now the error will propagate and abort heal_state_trie (via heal_state_batch), potentially stopping healing for the entire cycle. If membatch can be cleared/partially persisted between iterations (e.g., after nodes_to_write flushes) while child nodes are still being committed, this can become a reachable failure mode and regress healing progress from “crash loud” to “silently give up healing”. Consider confirming (or asserting) the invariant that every committed child’s parent_path must always be present in membatch at that point, or handle the missing-parent case by re-queuing/recovering rather than returning a fatal error.

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/networking/p2p/sync/state_healing.rs Line: 397:399 Comment: **New hard error path** `commit_node` now returns `Err(SyncError::CorruptPath)` if `membatch.remove(parent_path)` is missing. Previously this was a `panic!`, but now the error will propagate and abort `heal_state_trie` (via `heal_state_batch`), potentially stopping healing for the entire cycle. If `membatch` can be cleared/partially persisted between iterations (e.g., after `nodes_to_write` flushes) while child nodes are still being committed, this can become a reachable failure mode and regress healing progress from “crash loud” to “silently give up healing”. Consider confirming (or asserting) the invariant that every committed child’s `parent_path` must always be present in `membatch` at that point, or handle the missing-parent case by re-queuing/recovering rather than returning a fatal error. How can I resolve this? If you propose a fix, please make it concise.

The invariant is correct.

greptile-apps · 2026-02-06T15:01:39Z

Additional Comments (1)

crates/networking/p2p/peer_handler.rs
Behavior change to unrecoverable

In the storage-range processing branch, the previous panic!("Should have found the account hash") is now an UnrecoverableError return. Because this path sits inside the request loop, hitting this condition will now terminate request_storage_ranges entirely (rather than crash), which may leave sync stuck without an obvious retry path. If this condition is realistically reachable due to inconsistent accounts_by_root_hash vs accounts_with_storage_root contents (e.g., data races between healing and range requests, or partial updates), it should likely be treated as a recoverable error that requeues the task instead of bailing out of storage-range downloading.

(Also appears in the earlier state-trie lookup conversion in this function: missing account now yields UnrecoverableError, which will similarly abort the whole request.)

Prompt To Fix With AI

This is a comment left during a code review.
Path: crates/networking/p2p/peer_handler.rs
Line: 1329:1339

Comment:
**Behavior change to unrecoverable**

In the storage-range processing branch, the previous `panic!("Should have found the account hash")` is now an `UnrecoverableError` return. Because this path sits inside the request loop, hitting this condition will now terminate `request_storage_ranges` entirely (rather than crash), which may leave sync stuck without an obvious retry path. If this condition is realistically reachable due to inconsistent `accounts_by_root_hash` vs `accounts_with_storage_root` contents (e.g., data races between healing and range requests, or partial updates), it should likely be treated as a recoverable error that requeues the task instead of bailing out of storage-range downloading.

(Also appears in the earlier state-trie lookup conversion in this function: missing account now yields `UnrecoverableError`, which will similarly abort the whole request.)

How can I resolve this? If you propose a fix, please make it concise.

ilitteri · 2026-02-06T17:46:13Z

crates/blockchain/dev/block_producer.rs

+        let payload_id = match fork_choice_response.payload_id {
+            Some(id) => id,
+            None => {
+                tracing::error!(
+                    "Failed to produce block: payload_id is None in ForkChoiceResponse"
+                );
+                sleep(Duration::from_millis(300)).await;
+                tries += 1;
+                continue;
+            }
+        };


let Some = else pattern here.

Improved in 5511f8c

# Conflicts: # crates/networking/p2p/peer_handler.rs # crates/networking/p2p/sync/healing/state.rs

# Conflicts: # crates/blockchain/blockchain.rs

ElFantasma · 2026-02-10T00:47:33Z

crates/networking/p2p/network.rs

    #[error("Failed to start Tx Broadcaster: {0}")]
    TxBroadcasterError(#[from] TxBroadcasterError),
+    #[error("Failed to bind UDP socket: {0}")]
+    UdpSocketError(#[from] std::io::Error),


nit: Adding a blanket From<std::io::Error> to NetworkError means any io::Error from any source in a function returning Result<_, NetworkError> will silently become UdpSocketError. Currently start_network only has the one UDP bind site, but this could be misleading if the function grows. A scoped .map_err() at the call site would be more precise:

let udp_socket = UdpSocket::bind(context.local_node.udp_addr()) .await .map_err(|e| NetworkError::UdpSocketError(e))?;

and drop the #[from] on the variant.

ElFantasma · 2026-02-11T18:57:45Z

crates/networking/p2p/sync/healing/state.rs

-    });
+    let mut healing_queue_entry = healing_queue
+        .remove(parent_path)
+        .ok_or(SyncError::CorruptPath)?;


SyncError::CorruptPath is semantically wrong here — that variant was introduced for filesystem path failures (create_dir_all, DirEntry), not for a missing parent in the healing queue. More importantly, the old panic message included parent_path and path which are very useful for debugging sync issues; this replacement loses that context entirely.

Consider a dedicated variant like SyncError::HealingQueueInconsistency(String) or at least something that carries the parent/child path info, e.g.:

let mut healing_queue_entry = healing_queue.remove(parent_path).ok_or_else(|| { SyncError::Custom(format!("Parent not found in healing queue. Parent: {parent_path:?}, path: {path:?}")) })?

(assuming a Custom(String) variant exists or is added).

fix: replace unjustified panics with proper error propagation

951c521

iovoid requested review from a team, ManuelBilbao, avilagaston9 and ilitteri as code owners February 6, 2026 14:58

github-actions bot assigned iovoid Feb 6, 2026

github-actions bot added the L1 Ethereum client label Feb 6, 2026

github-project-automation bot added this to ethrex_l1 Feb 6, 2026

greptile-apps bot reviewed Feb 6, 2026

View reviewed changes

fmt

72cc379

ilitteri reviewed Feb 6, 2026

View reviewed changes

ilitteri approved these changes Feb 6, 2026

View reviewed changes

iovoid added 2 commits February 6, 2026 15:07

use let-else

5511f8c

fmt

678f973

azteca1998 approved these changes Feb 9, 2026

View reviewed changes

iovoid added 3 commits February 9, 2026 10:36

Merge branch 'main' into replace_panics

f340d9d

# Conflicts: # crates/networking/p2p/peer_handler.rs # crates/networking/p2p/sync/healing/state.rs

Merge branch 'main' into replace_panics

fb212fb

# Conflicts: # crates/blockchain/blockchain.rs

fmt

17891ee

ElFantasma requested changes Feb 11, 2026

View reviewed changes

github-project-automation bot moved this to In Progress in ethrex_l1 Feb 11, 2026

		let mut membatch_entry = membatch.remove(parent_path).ok_or(SyncError::CorruptPath)?;

		membatch_entry.children_not_in_storage_count -= 1;

chore(l1): replace unjustified panics with proper error propagation #6147

Are you sure you want to change the base?

chore(l1): replace unjustified panics with proper error propagation #6147

Uh oh!

Conversation

iovoid commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 6, 2026

🤖 Kimi Code Review

Review Summary

Critical Issues

Medium Issues

Minor Issues

Positive Changes

Uh oh!

github-actions bot commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Lines of code report

Uh oh!

greptile-apps bot commented Feb 6, 2026

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

github-actions bot commented Feb 6, 2026

🤖 Codex Code Review

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

iovoid Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

iovoid Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Feb 6, 2026

Uh oh!

ilitteri Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

iovoid Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

ElFantasma Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

ElFantasma Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

iovoid commented Feb 6, 2026 •

edited

Loading

github-actions bot commented Feb 6, 2026 •

edited

Loading

iovoid Feb 6, 2026 •

edited

Loading