Skip to content

Conversation

@avilagaston9
Copy link
Contributor

@avilagaston9 avilagaston9 commented Feb 9, 2026

Motivation

Add tooling to make prover benchmarking workflows reproducible and ergonomic: structured timing logs, a results collection script, load test improvements, and documentation.

Description

Bug fixes:

  • Fix stale VK file paths in deployer after guest program was moved out of prover
  • Fix off-by-one in batch_size metric (single-block batch reported as 0)

Load test improvements:

  • Use pending nonce for consecutive runs without nonce conflicts
  • Add --endless flag for continuous load generation
  • Add env var support: LOAD_TEST_RPC_URL, LOAD_TEST_TX_AMOUNT, LOAD_TEST_ENDLESS

Prover instrumentation:

  • Add proving time instrumentation (prove_timed) and log-parsing benchmark script (scripts/bench_metrics.sh)
  • Enrich benchmark script with Prometheus batch metadata (gas, txs, blocks)
  • Add --timed flag to prover CLI to control whether proving time is measured (default: disabled). Set via PROVER_CLIENT_TIMED env var.

Proof coordinator:

  • Add prover_type field to BatchRequest so the coordinator knows which backend the client runs
  • Coordinator skips assigning batches to provers whose type isn't needed (e.g. an SP1 prover won't be assigned work when only exec proofs are required)
  • Add prover_type() method to the ProverBackend trait

Documentation:

  • Add prover benchmarking guide (docs/l2/prover-benchmarking.md)
  • Add agent workflow for prover benchmarking (docs/workflows/prover_benchmarking.md)

Checklist

  • Updated STORE_SCHEMA_VERSION (crates/storage/lib.rs) if the PR includes breaking changes to the Store requiring a re-sync.

Switch the prover loop from prove() to prove_timed() so each batch logs
a structured line with batch number and elapsed proving time (seconds and
milliseconds). Add scripts/sp1_bench_metrics.sh that tails the prover log,
collects results into a CSV, and prints a summary table on exit.
… of prover

The guest program was moved from crates/l2/prover/src/ethrex_guest_program/
to crates/guest-program/, but the fallback VK paths in the deployer were not
updated. This caused deploy-l1-sp1 to fail with "No such file or directory"
when running outside Docker. CI was unaffected because it passes explicit
VK paths via ETHREX_SP1_VERIFICATION_KEY_PATH in docker-compose.yaml.
…nce conflicts,

and add --endless flag for continuous load generation. The load_test function now
fetches the pending nonce (instead of latest) so re-runs pick up where the previous
round left off. wait_until_all_included now tracks per-account target nonces instead
of a flat tx_amount, which was incorrect for any run where the starting nonce was
not zero.
…ndpoint.

The script now fetches batch_gas_used, batch_tx_count, and batch_size from
the L2 metrics endpoint (localhost:3702/metrics) for each proved batch, joining
them with proving time into a single CSV and summary table.
…ing.

It now parses all proving_time lines from the file, fetches batch metadata
from Prometheus, prints the table, and exits immediately.
…last]

contains last - first + 1 blocks, not last - first. A single-block batch
was reported as size 0.
…ment.

When enabled (default), the prover wraps each prove() call with timing and logs
structured fields (proving_time_s, proving_time_ms). When disabled with --no-timed,
the prover calls prove() directly without timing overhead. Also fixes a clippy
as_conversions warning in the elapsed.as_millis() cast.
…ess mode.

The clap args now accept LOAD_TEST_RPC_URL, LOAD_TEST_TX_AMOUNT, and
LOAD_TEST_ENDLESS environment variables as alternatives to --node, --tx-amount,
and --endless flags respectively. CLI flags take precedence over env vars.
The guide (docs/l2/prover-benchmarking.md) explains how to use the localnet,
prover timing, load test, and benchmark script together to measure proving
performance. The agent workflow (docs/workflows/prover_benchmarking.md) provides
step-by-step instructions for running a remote benchmark session, following the
same format as the existing execution witness benchmarking workflow.
@github-actions
Copy link

github-actions bot commented Feb 9, 2026

🤖 Kimi Code Review

Review Summary

Overall, this PR introduces useful benchmarking infrastructure for the L2 prover with good observability improvements. The changes are mostly well-structured, but there are a few issues to address.

Issues Found

1. Potential Overflow in l1_committer.rs (Line 976)

let batch_size = (last_added_block_number - first_block_of_batch + 1).try_into()?;

Issue: The + 1 addition could overflow if last_added_block_number == u64::MAX. While unlikely in practice, this should use checked_add(1) or handle the edge case.

2. Path Construction in deployer.rs (Lines 1018, 1025, 1029)

format!("{}/../../crates/guest-program/bin/...", env!("CARGO_MANIFEST_DIR"))

Issue: Path construction is fragile. Consider using PathBuf and join() for cross-platform compatibility:

let base_path = PathBuf::from(env!("CARGO_MANIFEST_DIR"))
    .parent()
    .unwrap()
    .parent()
    .unwrap()
    .join("crates/guest-program/bin/...");

3. Error Handling in prover.rs (Lines 108-111)

proving_time_ms = u64::try_from(elapsed.as_millis()).unwrap_or(u64::MAX)

Issue: Using u64::MAX as a fallback could lead to confusing metrics. Consider logging an error and using a more reasonable fallback like 0 or the seconds value converted to ms.

4. Missing Validation in sp1_bench_metrics.sh

Issue: The script doesn't validate that the metrics URL is reachable before attempting to fetch data. Add a connectivity check:

if ! curl -s --connect-timeout 5 "$METRICS_URL" > /dev/null 2>&1; then
    echo "Error: Cannot connect to metrics endpoint at $METRICS_URL"
    exit 1
fi

5. Race Condition in Load Test

Issue: In load_test/src/main.rs, the wait_until_all_included function uses BlockTag::Latest which might not include pending transactions. Consider using BlockTag::Pending consistently or adding a small delay between checks.

Positive Aspects

  • Good CLI ergonomics: The new --timed flag and environment variables follow consistent patterns
  • Comprehensive documentation: The benchmarking guide is well-structured and includes troubleshooting
  • Structured logging: The timing information is logged in a parseable format
  • Backward compatibility: Default values maintain existing behavior

Minor Suggestions

  1. Documentation: Add a note in prover-benchmarking.md about the potential performance overhead of --timed
  2. Metrics: Consider adding a metric for failed proof attempts in the prover
  3. Script: The benchmark script could benefit from a --help flag and version information

The changes are generally well-implemented and the benchmarking infrastructure will be valuable for performance analysis.


Automated review by Kimi (Moonshot AI) · custom prompt

@avilagaston9 avilagaston9 changed the title Extract prover benchmarking tooling into standalone PR feat(l2): add prover benchmarking tooling and documentation Feb 9, 2026
@github-actions github-actions bot added the L2 Rollup client label Feb 9, 2026
@github-actions
Copy link

github-actions bot commented Feb 9, 2026

🤖 Claude Code Review

Based on my review of PR #6157, here's my comprehensive code review:


Code Review: PR #6157 - Extract Prover Benchmarking Tooling

Summary

This PR adds benchmarking infrastructure for proving performance, including timing instrumentation, load test improvements, and documentation. The changes are mostly low-risk tooling enhancements.

Critical Issues

1. Bash Script: Potential Integer Overflow (scripts/sp1_bench_metrics.sh:85-93)

count=0; total=0; min=999999999; max=0; total_gas=0; total_txs=0

Issue: Bash integer arithmetic can overflow for large millisecond values. If proving takes hours, total could exceed bash's integer limits (~2^63-1).

Fix: Add overflow protection or use bc for large numbers:

if [[ $total -gt 9223372036854775807 ]]; then
    echo "Warning: overflow detected in timing calculations"
fi

2. Bash Script: Division by Zero Risk (scripts/sp1_bench_metrics.sh:97)

avg=$((total / count))

Issue: While protected by if [[ $count -gt 0 ]], the logic flow could be clearer.

Status: Actually safe, but consider making it more explicit.


High Priority Issues

3. Load Test: Unwrap Without Error Context (tooling/load_test/src/main.rs:232-234)

let nonce = client
    .get_nonce(account.address(), BlockIdentifier::Tag(BlockTag::Pending))
    .await
    .unwrap();

Issue: .unwrap() will panic if RPC fails. In async task, this terminates only that task, but provides no useful error message.

Fix:

.await
.context("Failed to fetch pending nonce")?;

4. Load Test: Missing Error Propagation Context (tooling/load_test/src/main.rs:285-288)

let nonce = client
    .get_nonce(src, BlockIdentifier::Tag(BlockTag::Latest))
    .await
    .unwrap();

Issue: Same as above in the wait loop.


Medium Priority Issues

5. Prover: Potential U64 Overflow in Milliseconds (crates/l2/prover/src/prover.rs:110)

proving_time_ms = u64::try_from(elapsed.as_millis()).unwrap_or(u64::MAX),

Issue: Using unwrap_or(u64::MAX) masks overflow. For proving times >584 million years this saturates to u64::MAX, which could confuse metrics aggregation.

Recommendation: This is practically fine (proving won't take that long), but consider logging a warning on saturation for correctness:

proving_time_ms = u64::try_from(elapsed.as_millis())
    .inspect_err(|_| warn!("Proving time exceeds u64::MAX ms, clamping"))
    .unwrap_or(u64::MAX),

6. Bash Script: Inefficient Metric Fetching (scripts/sp1_bench_metrics.sh:30-37)

fetch_metric() {
    local metric="$1" batch="$2"
    curl -s "$METRICS_URL" 2>/dev/null \
        | grep "^${metric}{" \
        | grep "batch_number=\"${batch}\"" \
        | awk '{print $2}' \
        | head -1
}

Issue: This fetches the entire metrics endpoint for EACH metric for EACH batch (3 metrics per batch). For 100 batches, this makes 300 HTTP requests.

Fix: Fetch metrics once and cache:

METRICS_CACHE=$(curl -s "$METRICS_URL" 2>/dev/null)
fetch_metric() {
    echo "$METRICS_CACHE" | grep "^${1}{" | grep "batch_number=\"${2}\"" | awk '{print $2}' | head -1
}

7. Load Test: Nonce Race Condition (tooling/load_test/src/main.rs:232)

let nonce = client
    .get_nonce(account.address(), BlockIdentifier::Tag(BlockTag::Pending))
    .await
    .unwrap();

Issue: Using BlockTag::Pending is good, but if multiple load test instances run concurrently for the same account, they could get the same nonce range and collide.

Status: Acceptable for single-instance testing. Document this limitation in the guide or add a warning.


Low Priority / Style Issues

8. Load Test: Clone in Tight Loop (tooling/load_test/src/main.rs:240-260)

for i in 0..tx_amount {
    let (value, calldata, dst) = tx_builder.build_tx();
    // ...
    sleep(Duration::from_micros(800)).await;
}

Issue: tx_builder.build_tx() is called in a loop. If tx_builder involves cloning, this could be optimized.

Status: Likely negligible impact. The sleep dominates timing.

9. Prover: Duplicate Log Messages (crates/l2/prover/src/prover.rs:111-124)

"Proved batch {} in {:.2?}",
prover_data.batch_number,
elapsed
// vs
"Proved batch {}", prover_data.batch_number

Suggestion: The two branches have similar messages. Consider extracting common logging:

let batch_proof = if self.timed {
    let (output, elapsed) = self.backend.prove_timed(...)?;
    info!(batch = ..., proving_time_s = ..., proving_time_ms = ..., "Proved batch {} in {:.2?}", ...);
    output
} else {
    let output = self.backend.prove(...)?;
    info!(batch = ..., "Proved batch {}", ...);
    output
};
self.backend.to_batch_proof(output, prover_data.format)

10. Documentation: Missing Batch Size Explanation (docs/l2/prover-benchmarking.md:127)

| `batch_size{batch_number="N"}` | Number of blocks in batch N |

Clarification: Given the fix in l1_committer.rs:976, confirm this metric now correctly represents block count (inclusive).


Positive Observations

Off-by-one fix is correct (crates/l2/sequencer/l1_committer.rs:976)

  • batch_size = (last_added_block_number - first_block_of_batch + 1) correctly includes both endpoints.

VK path updates are correct (cmd/ethrex/l2/deployer.rs:1018-1028)

  • Paths properly reflect the moved guest program location.

Timed flag design is clean

  • Defaulting to true is sensible for a benchmarking tool.
  • Env var support (PROVER_CLIENT_TIMED) is good for CI/scripts.

Load test improvements are valuable

  • Using BlockTag::Pending for nonces fixes consecutive run issues.
  • --endless flag is useful for long-running benchmarks.
  • Env var support improves scriptability.

Documentation is thorough

  • Both the guide and workflow docs are clear and actionable.

ProverBackend trait usage is correct

  • The prove_timed method has a default implementation in the trait, so all backends automatically support it.

Security Considerations

No critical security issues found. This is primarily tooling/benchmarking code that:

  • Doesn't touch consensus logic
  • Doesn't modify transaction/block validation
  • Doesn't affect state trie or storage
  • Doesn't change cryptographic operations

Performance Considerations

  • The bash script's N×3 HTTP requests could be slow for large batch counts (see issue Add lint and test jobs to CI #6).
  • Load test nonce fetching is appropriately using Pending to avoid gaps.
  • Timing instrumentation overhead is minimal (single Instant::now() call).

Recommendations

  1. Must fix: Add error context to load test unwraps (build: add Github actions boilerplate #3, docs: update milestones. #4)
  2. Should fix: Optimize bash script metric fetching (Add lint and test jobs to CI #6)
  3. Consider: Add overflow warning to proving_time_ms (feat: add basic RPC api. #5)
  4. Consider: Document nonce race condition limitation (Add Dockerfile #7)
  5. Nice to have: Refactor duplicate logging in prover (feat: add support for separate engine and rpc apis #9)

Verdict

Approved with minor suggestions. The code is well-structured, the documentation is excellent, and the changes are low-risk. The main concerns are error handling in the load test (easy fix) and bash script efficiency (minor optimization).


Automated review by Claude (Anthropic) · custom prompt

@github-actions
Copy link

github-actions bot commented Feb 9, 2026

🤖 Codex Code Review

Findings (ordered by severity)

  1. Benchmark log can record “Proved batch” even when to_batch_proof fails
  • In both timed and non‑timed branches, the log line is emitted before to_batch_proof, so a conversion error will still produce a success log that the new benchmark script parses. This can inflate results and mislead operators. Consider logging only after to_batch_proof succeeds (or include a status flag).
  • File: crates/l2/prover/src/prover.rs:102
  1. Potential underflow in inclusive batch size calculation
  • batch_size is now (last_added_block_number - first_block_of_batch + 1). If last_added_block_number < first_block_of_batch (bad input or off‑by‑one earlier), this underflows on u64 and could emit a huge metric. Suggest checked_sub with a clear error or guard before adding 1.
  • File: crates/l2/sequencer/l1_committer.rs:976

Other notes

  • The --timed option and log enrichment look fine, but consider whether the timing log should be emitted only on successful proof/submit, especially since the new script uses it as ground truth.
  • The LOAD_TEST_* env vars and pending‑nonce target tracking look good; behavior changes are reasonable.

If you want, I can propose concrete patches for the two items above.


Automated review by OpenAI Codex · custom prompt

@github-actions
Copy link

github-actions bot commented Feb 9, 2026

Lines of code report

Total lines added: 67
Total lines removed: 2
Total lines changed: 69

Detailed view
+-------------------------------------------------+-------+------+
| File                                            | Lines | Diff |
+-------------------------------------------------+-------+------+
| ethrex/cmd/ethrex/l2/options.rs                 | 1080  | +10  |
+-------------------------------------------------+-------+------+
| ethrex/crates/l2/common/src/prover.rs           | 197   | +7   |
+-------------------------------------------------+-------+------+
| ethrex/crates/l2/prover/src/backend/exec.rs     | 67    | +3   |
+-------------------------------------------------+-------+------+
| ethrex/crates/l2/prover/src/backend/mod.rs      | 96    | +1   |
+-------------------------------------------------+-------+------+
| ethrex/crates/l2/prover/src/backend/openvm.rs   | 103   | +3   |
+-------------------------------------------------+-------+------+
| ethrex/crates/l2/prover/src/backend/risc0.rs    | 131   | +3   |
+-------------------------------------------------+-------+------+
| ethrex/crates/l2/prover/src/backend/sp1.rs      | 182   | +3   |
+-------------------------------------------------+-------+------+
| ethrex/crates/l2/prover/src/backend/zisk.rs     | 158   | +3   |
+-------------------------------------------------+-------+------+
| ethrex/crates/l2/prover/src/config.rs           | 12    | +1   |
+-------------------------------------------------+-------+------+
| ethrex/crates/l2/prover/src/prover.rs           | 217   | +32  |
+-------------------------------------------------+-------+------+
| ethrex/crates/l2/sequencer/proof_coordinator.rs | 424   | -2   |
+-------------------------------------------------+-------+------+
| ethrex/crates/l2/tee/quote-gen/src/sender.rs    | 86    | +1   |
+-------------------------------------------------+-------+------+

or in a loop depending on --endless, instead of a loop-with-break pattern.
…lient,

tx_builder, and accounts instead of owned values, removing unnecessary clones
at the call sites in run_round.
This removes the need for --no-timed; users pass --timed when they want
proving time measurement, and omit it otherwise.
both the default and gpu-enabled cargo run commands.
--timed can be passed via make (e.g. make init-prover-sp1 PROVER_ARGS=--timed).
Update docs to use the Makefile commands with GPU=true and PROVER_ARGS examples.
PROVER_ARGS, so timing is enabled with TIMED=true (e.g. make init-prover-sp1
GPU=true TIMED=true).
…alias.

Clap already reads the env var, so no Makefile plumbing is needed. Reverts
the TIMED/PROVER_ARGS additions to the Makefile targets.
…ummary

instead of CSV. Add load-test Makefile target with env var configuration.
Update docs to use Makefile commands throughout, add mempool limit warning,
note that L2 must be running when collecting results, and reference
PROVER_CLIENT_TIMED env var instead of --timed flag.
root Makefile target instead. Update docs to run make load-test from
the repo root.
script. The report now includes a Server Specs section with hardware
info detected from /proc/cpuinfo, /proc/meminfo, sysctl, and nvidia-smi.
by default) in the prover benchmarking workflow.
batch assignments by whether the prover's type is needed. Previously
the coordinator assigned a batch whenever ANY required proof was missing,
causing provers to waste time re-proving batches they had already proved
(e.g. an SP1 prover proving a batch that only needed an exec proof).

The ProverBackend trait now has a prover_type() method, implemented for
each backend (todo!() for ZisK and OpenVM which lack ProverType variants).
The prover client stores its type at startup and sends it with every
BatchRequest. The coordinator checks needed_proof_types.contains() before
assigning.

Also fixes the deploy-l1-sp1 Makefile target which used the removed
--sp1.deploy-verifier flag — replaced with --sp1 and removed dummy
verifier addresses for other backends. Updated benchmarking docs to
reference deploy-l1-sp1 for SP1 benchmarks.
already exposes it via the prover_type() trait method.
avilagaston9 added a commit that referenced this pull request Feb 9, 2026
… todo!()

for ZisK and OpenVM backends (no ProverType variant exists for them yet) and
call self.backend.prover_type() directly instead of storing it as a field.
…oad_test formatting.

Commit 517e535 added prover_type to BatchRequest but didn't update the TDX
quote-gen sender (outside the workspace), breaking the tdx lint/build CI jobs.
The load_test formatting was also off since it's outside the main workspace.
…arted

immediately after the L2 is up, not just before the prover. The prover proves
batches faster than the sequencer fills them with transactions, so any delay
in starting the load test results in empty batches being proved.
empty batches. The block producer and committer build and commit
empty blocks/batches faster than the load test can start sending
transactions (especially for erc20/fibonacci/io-heavy which have
a contract deployment setup phase). By pre-compiling the binary
in step 1b and running it directly in step 3, the load test starts
immediately when the L2 is up.
…t recompilation.

Previously each step (deploy, L2 start, prover start) triggered its own `cargo run`
which recompiled the binary. Now step 1b builds everything once with all needed features,
and all subsequent steps use `target/release/ethrex` directly. Also made the docs
backend-agnostic (not SP1-specific) with placeholder flags and tables for SP1/RISC0/Exec,
and made the branch/commit prompt mandatory (no default).
avilagaston9 added a commit that referenced this pull request Feb 10, 2026
ProofCoordinator: the field is used again in PR #6157 and is preserved
to simplify merge conflicts.
…name

sp1_bench_metrics.sh to bench_metrics.sh since the script is backend-agnostic,
fix the load test targets table in the workflow doc to show -t flag values
instead of Makefile targets, and add a note clarifying that the hardcoded
keys in the deploy command are default test keys from fixtures/.
fix the load test progress percentage to be relative to the current round
(not the absolute nonce), make run_round return Result so endless mode
survives individual round failures, remove a duplicate debug log in the
proof coordinator, add bash 4+ note to bench_metrics.sh, and clarify that
LOAD_TEST_RPC_URL must point to the L2 port for benchmarking.
backend that is not in the deployment's required proof types, instead of
silently sending an empty batch response. The prover now logs an error
and surfaces the rejection clearly, rather than endlessly polling for
batches it will never be assigned.
…ndant:

we already check whether this specific prover type's proof exists for the
batch, and we early-return if the prover type is not needed at all. If
the current prover's proof exists, the batch is skipped regardless of
what other proof types are missing.
if/else block. Each condition (proof already exists, batch not found, no
input for version) is checked and returned independently.
get_prover_input_by_batch_and_version already returns None when the
batch does not exist. Also send an empty batch response instead of
NoBatchForVersion when no input is found, since the prover just needs
to know there is nothing to prove right now.
that unneeded prover types are rejected immediately without hitting the
database.
@avilagaston9 avilagaston9 marked this pull request as ready for review February 11, 2026 13:43
Copilot AI review requested due to automatic review settings February 11, 2026 13:43
@avilagaston9 avilagaston9 requested review from a team and ManuelBilbao as code owners February 11, 2026 13:43
@github-actions
Copy link

🤖 Kimi Code Review

Review Summary

This PR introduces prover benchmarking capabilities and fixes several issues in the L2 prover system. Here are the key findings:

Good Changes

  1. Prover benchmarking infrastructure - Well-structured addition with proper documentation
  2. Path fixes - Corrects verification key paths in deployer.rs (lines 1018, 1023, 1027)
  3. Batch size calculation fix - Correct off-by-one error in l1_committer.rs:976
  4. Prover type filtering - Smart optimization to skip unnecessary proof types

⚠️ Issues Found

1. Unimplemented Functions (Critical)

  • OpenVmBackend::prover_type() and ZiskBackend::prover_type() return unimplemented!()
  • Location: crates/l2/prover/src/backend/openvm.rs:68, crates/l2/prover/src/backend/zisk.rs:117
  • Fix: Return appropriate ProverType variants instead of panicking

2. Potential Overflow (Medium)

  • u64::try_from(elapsed.as_millis()).unwrap_or(u64::MAX) in prover.rs:110
  • Location: crates/l2/prover/src/prover.rs:110
  • Fix: Use .unwrap_or_else(|_| u64::MAX) for better error handling

3. Missing Error Handling (Medium)

  • tee/quote-gen/src/sender.rs:16 - Missing prover_type parameter in BatchRequest
  • Fix: Add prover_type: ProverType::TDX to match new signature

4. Documentation Inconsistency (Low)

  • docs/workflows/prover_benchmarking.md uses hardcoded private keys
  • Recommendation: Add warning about test-only keys

🔧 Suggested Improvements

  1. Add validation for prover_type parameter in ProofData::batch_request
  2. Consider using Duration::as_secs_f64() for more precise timing in logs
  3. Add metrics for rejected prover types in proof_coordinator.rs

🎯 Security Considerations

  • No critical security issues found
  • Private keys in docs are test-only and properly documented

📊 Performance Impact

  • Minimal overhead from new timing measurements
  • Prover type filtering reduces unnecessary work

The PR is well-structured and the benchmarking additions are valuable. The unimplemented functions need immediate attention before merging.


Automated review by Kimi (Moonshot AI) · custom prompt

@github-actions
Copy link

🤖 Codex Code Review

Findings (ordered by severity)

  • Potential prover starvation/backlog: ProofCoordinator::handle_batch_request now gets the “next batch to prove” without considering the prover’s type, then returns an empty response if that type’s proof already exists. If next_batch_to_prove_for_version returns the earliest batch missing any proof type, a prover whose type is already done for that batch will keep getting empty responses and won’t progress to later batches that still need its proof. This can idle a prover indefinitely while other proof types catch up. Consider selecting the next batch that is missing this prover_type instead of returning empty.
    Path: crates/l2/sequencer/proof_coordinator.rs (around the new “proof already exists” check).

  • Runtime panic for enabled backends: ProverBackend::prover_type() is unimplemented!() for OpenVM and ZisK. If those backends are selectable via config/CLI (or enabled by feature), the prover will panic the first time it requests a batch. Prefer returning the correct ProverType or hard‑disable these backends at configuration/CLI level with a clear error.
    Paths: crates/l2/prover/src/backend/openvm.rs (around line 64), crates/l2/prover/src/backend/zisk.rs (around line 113).

  • Protocol compatibility risk: ProofData::BatchRequest gained a new field (prover_type) and a new variant (ProverTypeNotNeeded). If this enum is serialized over the network (likely), older coordinators/provers will fail to decode and disconnect. Ensure the protocol version is bumped or add backward‑compatible decoding/handshake.
    Path: crates/l2/common/src/prover.rs (around lines 179–210).

Notes / minor observations

  • The change in batch size calculation (+1) looks like an off‑by‑one fix and should be correct if first_block_of_batch and last_added_block_number are inclusive.
    Path: crates/l2/sequencer/l1_committer.rs (around line 973).

  • Logging and timing in the prover look fine; PROVER_CLIENT_TIMED is cleanly plumbed.
    Paths: cmd/ethrex/l2/options.rs, crates/l2/prover/src/prover.rs.

If you want, I can dig into next_batch_to_prove_for_version to confirm the starvation scenario and suggest a concrete fix.


Automated review by OpenAI Codex · custom prompt

@github-actions
Copy link

🤖 Claude Code Review

Now I have enough context to provide a comprehensive review. Let me compile my findings:

PR Review: feat(l2): add prover benchmarking tooling and documentation

Overview

This PR adds comprehensive tooling for reproducible prover benchmarking, including timing instrumentation, metrics collection, load testing improvements, and documentation. The changes are well-structured and the feature additions are valuable for performance measurement.

Critical Issues

1. Proof Coordinator Logic Change - Potential Race Condition Risk ⚠️

Location: crates/l2/sequencer/proof_coordinator.rs:167-176

The coordinator now checks prover_type before calling next_batch_to_prove_for_version(). This is good for early rejection, but the order of operations has changed significantly:

Before: The coordinator checked if ALL needed proof types existed for a batch before assigning work.
After: It only checks if THIS prover type's proof exists.

Issue: While the commit message says "Move the prover type check before the store query," this optimization comes at line 167-176, which is good. However, the removal of the "all proofs exist" check (old lines 385-396) changes the fundamental coordination logic. Multiple provers of different types can now work on the same batch simultaneously, which might be intentional but should be verified.

Recommendation: Confirm this behavior change is intentional. If so, document it in the commit message or comments.


2. Error Handling - Prover Terminates on ProverTypeNotNeeded 🔴

Location: crates/l2/prover/src/prover.rs:329-337

ProofData::ProverTypeNotNeeded { prover_type } => {
    error!(...);
    return Err(format!(...)); // This causes the request to fail
}

When a prover receives ProverTypeNotNeeded, it returns an Err, which propagates up and causes the prover to continue the loop but skip that coordinator endpoint. However, this error is logged at error! level when it's actually an expected operational state (wrong backend type for this deployment).

Issues:

  • The error message is misleading - this isn't a failure, it's a configuration mismatch
  • The prover will keep retrying this coordinator in every loop iteration, generating spam
  • Should either: (a) remove that coordinator from the list permanently, or (b) downgrade to warn! and add exponential backoff

Recommendation: Either remove incompatible coordinators from the list after first rejection, or handle this as a non-error condition with backoff.


3. Off-by-One Fix Verification Needed ⚠️

Location: crates/l2/sequencer/l1_committer.rs:350

-let batch_size = (last_added_block_number - first_block_of_batch).try_into()?;
+let batch_size = (last_added_block_number - first_block_of_batch + 1).try_into()?;

This fixes the metric from reporting "0" for single-block batches, but verify the semantics:

  • If first_block_of_batch = 5 and last_added_block_number = 5, this now reports batch_size = 1
  • If first_block_of_batch = 5 and last_added_block_number = 7, this now reports batch_size = 3

Recommendation: Add a test case to prevent regression, especially since this metric is now used in benchmarking reports.


Major Issues

4. Load Test - Nonce Handling Still Fragile

Location: tooling/load_test/src/main.rs:232-233

The load test now uses BlockTag::Pending for nonce (good improvement!), but there's still a potential issue:

let nonce = client.get_nonce(account.address(), BlockIdentifier::Tag(BlockTag::Pending)).await.unwrap();

Issues:

  • Uses .unwrap() - if the RPC call fails, the entire test panics
  • No retry logic for transient network errors
  • Documentation warns about mempool limits but doesn't implement any safeguards

Recommendation: Add proper error handling with retry logic for the nonce fetch.


5. Bash Script - No Input Validation ⚠️

Location: scripts/bench_metrics.sh:44-56

The script uses basic string parsing with grep -o and cut, but doesn't validate numeric values:

batch=$(echo "$line" | grep -o 'batch=[0-9]*' | head -1 | cut -d= -f2)
secs=$(echo "$line" | grep -o 'proving_time_s=[0-9]*' | cut -d= -f2)

Issues:

  • If the log format changes slightly, parsing silently fails
  • No validation that batch, secs, ms are valid integers
  • The arithmetic operations later ($((total + ms))) will fail with non-numeric values

Recommendation: Add validation after parsing: [[ "$batch" =~ ^[0-9]+$ ]] before using values in arithmetic.


6. Bash Script - Integer Overflow Risk

Location: scripts/bench_metrics.sh:1054-1065

total=$((total + ms))
total_gas=$((total_gas + ${gas%%.*}))

Bash arithmetic is limited to signed 64-bit integers. For large benchmarks:

  • total_gas could overflow if many batches with high gas usage are processed
  • Unlikely in practice but possible with extended benchmarking sessions

Recommendation: Consider using bc for large arithmetic or document the limitation.


Minor Issues

7. Inconsistent Error Handling in ProverBackend Trait

Location: Multiple backend implementations

The new prover_type() method uses unimplemented!() for OpenVM and ZisK:

fn prover_type(&self) -> ProverType {
    unimplemented!("OpenVM is not yet enabled as a backend for the L2")
}

Issue: If someone accidentally enables OpenVM backend, this will panic at runtime instead of failing at compile time.

Recommendation: Consider returning Option<ProverType> or using a compile-time check with feature flags.


8. Documentation - Missing Cleanup Instructions

Location: docs/workflows/prover_benchmarking.md:650-664

The cleanup section checks for old databases but doesn't actually remove them. The user must manually interpret the output and decide what to delete.

Recommendation: Add explicit cleanup commands or a helper script for one-step cleanup.


9. Load Test - Max Fee Per Gas Set to i64::MAX

Location: tooling/load_test/src/main.rs:252

max_fee_per_gas: Some(i64::MAX as u64),

Issue: This is casting i64::MAX (9,223,372,036,854,775,807) to u64, which is fine technically, but this is an astronomically high gas price that could cause issues if the test ever runs against a real network (even testnet).

Recommendation: Use a reasonable max gas price constant (e.g., 1000 * 10^9 for 1000 gwei) or make it configurable.


10. Trait Method Without Default Implementation

Location: crates/l2/prover/src/backend/mod.rs:160

The new prover_type() method is added to the trait without a default implementation, forcing all implementors to update. This is a breaking change for any external backend implementations.

Recommendation: Acceptable for internal use, but document this as a breaking API change if the trait is public.


Performance Considerations

11. Metrics Fetching in Benchmark Script

Location: scripts/bench_metrics.sh:949-956

The script makes a separate curl call for each metric for each batch:

gas=$(fetch_metric "batch_gas_used" "$batch" 2>/dev/null || true)
txs=$(fetch_metric "batch_tx_count" "$batch" 2>/dev/null || true)
blocks=$(fetch_metric "batch_size" "$batch" 2>/dev/null || true)

Issue: For N batches, this makes 3N HTTP requests. With 100 batches, that's 300 requests.

Recommendation: Fetch all metrics once and parse locally:

metrics=$(curl -s "$METRICS_URL" 2>/dev/null)
gas=$(echo "$metrics" | grep "^batch_gas_used{.*batch_number=\"${batch}\"" | ...)

Security Considerations

12. SSH Commands in Documentation

Location: docs/workflows/prover_benchmarking.md

The workflow documentation shows direct SSH command execution patterns:

ssh <server> "bash -l -c 'command'"

Issue: This is secure for the documented use case, but the commands include sensitive paths and private keys. If users copy-paste without understanding, they might expose credentials.

Recommendation: Add a security note about protecting private keys and not using production credentials.


Code Quality & Style

13. Inconsistent Logging Levels

The prover logs batch completion at info! level whether timed or not, which is good. However, the coordinator rejection is logged as error! when it should be warn! or info!.

14. Magic Numbers

tooling/load_test/src/main.rs:260: sleep(Duration::from_micros(800)) - This 800µs delay should be a named constant with explanation.

15. Duplicate Code

The prover's prove vs prove_timed branches (prover.rs:102-127) have duplicated logging logic. Consider extracting a helper function.


Positive Notes ✓

  1. Protocol Design: The prover_type field in BatchRequest is well-designed and enables proper coordination
  2. Backward Compatibility: The PR doesn't break existing functionality (besides unimplemented backends)
  3. Documentation: The benchmarking guide and workflow docs are thorough and helpful
  4. Testing Infrastructure: The endless mode and env var support make the load test much more flexible
  5. Metrics Integration: Enriching benchmark results with Prometheus data is excellent for analysis
  6. Path Fixes: The VK path fixes prevent build failures after the guest program move

Summary

Approval Status: ⚠️ Conditional Approval - Fix critical issues before merge

Critical Issues to Address:

  1. Fix error handling for ProverTypeNotNeeded (issue chore: create project structure #2)
  2. Clarify/document the proof coordinator logic change (issue docs: add milestones #1)
  3. Add nonce error handling in load test (issue docs: update milestones. #4)

Recommended Before Merge:

Nice to Have:

The PR adds valuable infrastructure for performance measurement and the overall design is solid. The main concerns are around error handling and the coordinator logic change.


Automated review by Claude (Anthropic) · custom prompt

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds comprehensive tooling and documentation for benchmarking prover performance on the ethrex L2. The changes enable reproducible and ergonomic prover benchmarking workflows by adding structured timing logs, a results collection script, load test improvements, and detailed documentation for both manual and agent-assisted workflows.

Changes:

  • Adds prover instrumentation with prove_timed() method and --timed CLI flag for measuring proving time
  • Enhances proof coordinator to filter provers by type, preventing unnecessary work assignments when specific proof types aren't needed
  • Improves load test with pending nonce support, endless mode, and environment variable configuration for easier automation
  • Fixes two bugs: VK file paths after guest program relocation and off-by-one error in batch_size metric
  • Adds comprehensive documentation for prover benchmarking workflows

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tooling/load_test/src/main.rs Adds endless mode, uses pending nonces for consecutive runs, adds env var support, refactors into run_round function
scripts/bench_metrics.sh New benchmark results collection script that parses prover logs and enriches with Prometheus batch metadata
docs/workflows/prover_benchmarking.md New agent workflow documentation for remote server benchmarking
docs/l2/prover-benchmarking.md New user guide for prover benchmarking on local/remote setups
crates/l2/prover/src/prover.rs Adds timed proving support controlled by --timed flag
crates/l2/prover/src/config.rs Adds timed field to ProverConfig
crates/l2/prover/src/backend/mod.rs Adds prover_type() method to ProverBackend trait
crates/l2/prover/src/backend/*.rs Implements prover_type() method for all backend implementations
crates/l2/sequencer/proof_coordinator.rs Adds prover type filtering logic to skip unnecessary work assignments
crates/l2/common/src/prover.rs Adds prover_type field to BatchRequest and new ProverTypeNotNeeded response
crates/l2/tee/quote-gen/src/sender.rs Updates get_batch to include prover_type: ProverType::TDX in BatchRequest
crates/l2/sequencer/l1_committer.rs Fixes off-by-one error in batch_size metric calculation
cmd/ethrex/l2/deployer.rs Fixes VK file paths after guest program relocation to crates/guest-program/bin
cmd/ethrex/l2/options.rs Adds --timed CLI option for prover client
crates/l2/Makefile Simplifies deploy-l1-sp1 target to use --sp1 flag
Comments suppressed due to low confidence (1)

crates/l2/tee/quote-gen/src/sender.rs:52

  • The get_batch function does not handle the new ProofData::ProverTypeNotNeeded response variant that was added to the protocol. When a TDX prover connects but TDX proofs are not needed by the coordinator, it will send back ProverTypeNotNeeded, but this function will treat it as an unexpected response and return a generic error "Expecting ProofData::Response". Add a match arm to handle ProofData::ProverTypeNotNeeded and return a more descriptive error message.
pub async fn get_batch(commit_hash: String) -> Result<(u64, ProgramInput), String> {
    let batch = connect_to_prover_server_wr(&ProofData::BatchRequest {
        commit_hash: commit_hash.clone(),
        prover_type: ProverType::TDX,
    })
    .await
    .map_err(|e| format!("Failed to get Response: {e}"))?;
    match batch {
        ProofData::BatchResponse {
            batch_number,
            input,
            ..
        } => match (batch_number, input) {
            (Some(batch_number), Some(input)) => {
                #[cfg(feature = "l2")]
                let input = ProgramInput {
                    blocks: input.blocks,
                    execution_witness: input.execution_witness,
                    elasticity_multiplier: input.elasticity_multiplier,
                    blob_commitment: input.blob_commitment,
                    blob_proof: input.blob_proof,
                    fee_configs: input.fee_configs,
                };
                #[cfg(not(feature = "l2"))]
                let input = ProgramInput {
                    blocks: input.blocks,
                    execution_witness: input.execution_witness,
                };
                Ok((batch_number, input))
            }
            _ => Err("No blocks to prove.".to_owned()),
        },
        ProofData::NoBatchForVersion {
            commit_hash: server_code_version,
        } => Err(format!(
            "Next batch does not match with the current version. Server code: {}, Prover code: {}",
            server_code_version, commit_hash
        )),
        _ => Err("Expecting ProofData::Response".to_owned()),
    }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@greptile-apps
Copy link

greptile-apps bot commented Feb 11, 2026

Greptile Overview

Greptile Summary

This PR adds reproducible prover benchmarking support across the L2 stack: new optional timed proving instrumentation (controlled by --timed / PROVER_CLIENT_TIMED), a metrics parsing script, load-test improvements (pending nonce, endless mode, env var configuration), and new benchmarking documentation.

On the coordinator side, BatchRequest now includes prover_type, allowing the proof coordinator to reject/skip assigning batches to provers whose backend isn’t required by the current deployment, reducing wasted polling and DB work. The deployer also updates hardcoded verification-key paths after the guest program relocation, and a batch_size metric off-by-one is fixed in the committer.

Confidence Score: 3/5

  • This PR is close to safe to merge but has a runtime panic in certain feature/backends that should be fixed first.
  • Most changes are additive (instrumentation, scripts, docs) and the coordinator/request wiring looks consistent, but prover_type() is left as unimplemented!() in the OpenVM and ZisK backends while those backends remain selectable, which will panic at runtime when enabled/used.
  • crates/l2/prover/src/backend/openvm.rs, crates/l2/prover/src/backend/zisk.rs

Important Files Changed

Filename Overview
cmd/ethrex/l2/deployer.rs Updates default VK file paths for RISC0/SP1 to new guest-program location; otherwise unchanged.
cmd/ethrex/l2/options.rs Adds --timed / PROVER_CLIENT_TIMED flag to prover client options and threads it into ProverConfig.
crates/l2/Makefile Adds/adjusts localnet/prover convenience targets and env exports for benchmarking workflow.
crates/l2/common/src/prover.rs Extends BatchRequest with prover_type so coordinator can filter assignments; appears consistent with coordinator usage.
crates/l2/prover/src/backend/exec.rs Implements prover_type() for exec backend and related instrumentation hooks.
crates/l2/prover/src/backend/mod.rs Adds prover_type() to ProverBackend trait and propagates backend parsing; enables coordinator filtering by backend type.
crates/l2/prover/src/backend/openvm.rs Adds ProverBackend::prover_type() but currently unimplemented!() causing runtime panic if openvm backend is selected.
crates/l2/prover/src/backend/risc0.rs Implements prover_type() for Risc0 backend; no issues found.
crates/l2/prover/src/backend/sp1.rs Implements prover_type() for SP1 backend; no issues found.
crates/l2/prover/src/backend/zisk.rs Adds ProverBackend::prover_type() but currently unimplemented!() causing runtime panic if zisk backend is selected.
crates/l2/prover/src/config.rs Adds timed field to ProverConfig for controlling proving-time instrumentation.
crates/l2/prover/src/prover.rs Wires new prover_type into BatchRequest and optional timed proving; backend-specific prover_type() panics remain a risk.
crates/l2/sequencer/l1_committer.rs Fixes batch_size metric off-by-one; no issues found.
crates/l2/sequencer/proof_coordinator.rs Handles prover_type in BatchRequest and skips batches for unneeded prover types; logic appears consistent with new request format.
crates/l2/tee/quote-gen/src/sender.rs Adds prover_type=TDX in batch requests to coordinator; aligns with coordinator filtering behavior.
docs/l2/prover-benchmarking.md Adds benchmarking guide documentation.
docs/workflows/prover_benchmarking.md Adds agent workflow for benchmarking documentation.
scripts/bench_metrics.sh Adds bench log parsing script for prover timing metrics; not executed in build.
tooling/load_test/src/main.rs Adds pending-nonce usage, endless mode, and env var support for load test CLI.

Sequence Diagram

sequenceDiagram
    participant Prover as Prover client
    participant Coord as ProofCoordinator
    participant Store as Store/DB

    Prover->>Coord: BatchRequest{ chain_id, last_verified_batch, prover_type }
    Coord->>Store: fetch needed_proof_types + next batch/input
    alt prover_type not needed
        Coord-->>Prover: ProverTypeNotNeeded
    else proof already exists
        Coord-->>Prover: ProofAlreadyExists
    else no batch/input available
        Coord-->>Prover: EmptyBatch / No work
    else batch assigned
        Coord-->>Prover: BatchResponse{ batch, input, ... }
        Prover->>Prover: prove() / prove_timed() (if --timed)
        Prover->>Coord: submit proof
    end
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

19 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

ArgAction::Set, not SetTrue), shut down the prover process when the
coordinator rejects its type instead of retrying forever, and cache the
metrics endpoint response in bench_metrics.sh to avoid 3N HTTP requests.
avilagaston9 added a commit that referenced this pull request Feb 11, 2026
…mplemented!),

make send_batches_proof_to_contract private, add natspec to based contract's
verifyBatches noting it has no access control, update all L2 docs replacing
verifyBatch references with verifyBatches, move distributed_proving.md from
docs/prover/ to docs/l2/fundamentals/ since it describes the interaction between
proof coordinator, proof sender, and provers rather than prover internals, and
restructure the doc to be explanation-first with the testing guide at the end.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

L2 Rollup client

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant