-
Notifications
You must be signed in to change notification settings - Fork 162
feat(l2): add prover benchmarking tooling and documentation #6157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Switch the prover loop from prove() to prove_timed() so each batch logs a structured line with batch number and elapsed proving time (seconds and milliseconds). Add scripts/sp1_bench_metrics.sh that tails the prover log, collects results into a CSV, and prints a summary table on exit.
… of prover The guest program was moved from crates/l2/prover/src/ethrex_guest_program/ to crates/guest-program/, but the fallback VK paths in the deployer were not updated. This caused deploy-l1-sp1 to fail with "No such file or directory" when running outside Docker. CI was unaffected because it passes explicit VK paths via ETHREX_SP1_VERIFICATION_KEY_PATH in docker-compose.yaml.
…nce conflicts, and add --endless flag for continuous load generation. The load_test function now fetches the pending nonce (instead of latest) so re-runs pick up where the previous round left off. wait_until_all_included now tracks per-account target nonces instead of a flat tx_amount, which was incorrect for any run where the starting nonce was not zero.
…ndpoint. The script now fetches batch_gas_used, batch_tx_count, and batch_size from the L2 metrics endpoint (localhost:3702/metrics) for each proved batch, joining them with proving time into a single CSV and summary table.
…ing. It now parses all proving_time lines from the file, fetches batch metadata from Prometheus, prints the table, and exits immediately.
…last] contains last - first + 1 blocks, not last - first. A single-block batch was reported as size 0.
…ment. When enabled (default), the prover wraps each prove() call with timing and logs structured fields (proving_time_s, proving_time_ms). When disabled with --no-timed, the prover calls prove() directly without timing overhead. Also fixes a clippy as_conversions warning in the elapsed.as_millis() cast.
…ess mode. The clap args now accept LOAD_TEST_RPC_URL, LOAD_TEST_TX_AMOUNT, and LOAD_TEST_ENDLESS environment variables as alternatives to --node, --tx-amount, and --endless flags respectively. CLI flags take precedence over env vars.
The guide (docs/l2/prover-benchmarking.md) explains how to use the localnet, prover timing, load test, and benchmark script together to measure proving performance. The agent workflow (docs/workflows/prover_benchmarking.md) provides step-by-step instructions for running a remote benchmark session, following the same format as the existing execution witness benchmarking workflow.
🤖 Kimi Code ReviewReview SummaryOverall, this PR introduces useful benchmarking infrastructure for the L2 prover with good observability improvements. The changes are mostly well-structured, but there are a few issues to address. Issues Found1. Potential Overflow in
|
🤖 Claude Code ReviewBased on my review of PR #6157, here's my comprehensive code review: Code Review: PR #6157 - Extract Prover Benchmarking ToolingSummaryThis PR adds benchmarking infrastructure for proving performance, including timing instrumentation, load test improvements, and documentation. The changes are mostly low-risk tooling enhancements. Critical Issues1. Bash Script: Potential Integer Overflow (scripts/sp1_bench_metrics.sh:85-93)count=0; total=0; min=999999999; max=0; total_gas=0; total_txs=0Issue: Bash integer arithmetic can overflow for large millisecond values. If proving takes hours, Fix: Add overflow protection or use if [[ $total -gt 9223372036854775807 ]]; then
echo "Warning: overflow detected in timing calculations"
fi2. Bash Script: Division by Zero Risk (scripts/sp1_bench_metrics.sh:97)avg=$((total / count))Issue: While protected by Status: Actually safe, but consider making it more explicit. High Priority Issues3. Load Test: Unwrap Without Error Context (tooling/load_test/src/main.rs:232-234)let nonce = client
.get_nonce(account.address(), BlockIdentifier::Tag(BlockTag::Pending))
.await
.unwrap();Issue: Fix: .await
.context("Failed to fetch pending nonce")?;4. Load Test: Missing Error Propagation Context (tooling/load_test/src/main.rs:285-288)let nonce = client
.get_nonce(src, BlockIdentifier::Tag(BlockTag::Latest))
.await
.unwrap();Issue: Same as above in the wait loop. Medium Priority Issues5. Prover: Potential U64 Overflow in Milliseconds (crates/l2/prover/src/prover.rs:110)proving_time_ms = u64::try_from(elapsed.as_millis()).unwrap_or(u64::MAX),Issue: Using Recommendation: This is practically fine (proving won't take that long), but consider logging a warning on saturation for correctness: proving_time_ms = u64::try_from(elapsed.as_millis())
.inspect_err(|_| warn!("Proving time exceeds u64::MAX ms, clamping"))
.unwrap_or(u64::MAX),6. Bash Script: Inefficient Metric Fetching (scripts/sp1_bench_metrics.sh:30-37)fetch_metric() {
local metric="$1" batch="$2"
curl -s "$METRICS_URL" 2>/dev/null \
| grep "^${metric}{" \
| grep "batch_number=\"${batch}\"" \
| awk '{print $2}' \
| head -1
}Issue: This fetches the entire metrics endpoint for EACH metric for EACH batch (3 metrics per batch). For 100 batches, this makes 300 HTTP requests. Fix: Fetch metrics once and cache: METRICS_CACHE=$(curl -s "$METRICS_URL" 2>/dev/null)
fetch_metric() {
echo "$METRICS_CACHE" | grep "^${1}{" | grep "batch_number=\"${2}\"" | awk '{print $2}' | head -1
}7. Load Test: Nonce Race Condition (tooling/load_test/src/main.rs:232)let nonce = client
.get_nonce(account.address(), BlockIdentifier::Tag(BlockTag::Pending))
.await
.unwrap();Issue: Using Status: Acceptable for single-instance testing. Document this limitation in the guide or add a warning. Low Priority / Style Issues8. Load Test: Clone in Tight Loop (tooling/load_test/src/main.rs:240-260)for i in 0..tx_amount {
let (value, calldata, dst) = tx_builder.build_tx();
// ...
sleep(Duration::from_micros(800)).await;
}Issue: Status: Likely negligible impact. The sleep dominates timing. 9. Prover: Duplicate Log Messages (crates/l2/prover/src/prover.rs:111-124)"Proved batch {} in {:.2?}",
prover_data.batch_number,
elapsed
// vs
"Proved batch {}", prover_data.batch_numberSuggestion: The two branches have similar messages. Consider extracting common logging: let batch_proof = if self.timed {
let (output, elapsed) = self.backend.prove_timed(...)?;
info!(batch = ..., proving_time_s = ..., proving_time_ms = ..., "Proved batch {} in {:.2?}", ...);
output
} else {
let output = self.backend.prove(...)?;
info!(batch = ..., "Proved batch {}", ...);
output
};
self.backend.to_batch_proof(output, prover_data.format)10. Documentation: Missing Batch Size Explanation (docs/l2/prover-benchmarking.md:127)| `batch_size{batch_number="N"}` | Number of blocks in batch N |Clarification: Given the fix in l1_committer.rs:976, confirm this metric now correctly represents block count (inclusive). Positive Observations✅ Off-by-one fix is correct (crates/l2/sequencer/l1_committer.rs:976)
✅ VK path updates are correct (cmd/ethrex/l2/deployer.rs:1018-1028)
✅ Timed flag design is clean
✅ Load test improvements are valuable
✅ Documentation is thorough
✅ ProverBackend trait usage is correct
Security ConsiderationsNo critical security issues found. This is primarily tooling/benchmarking code that:
Performance Considerations
Recommendations
VerdictApproved with minor suggestions. The code is well-structured, the documentation is excellent, and the changes are low-risk. The main concerns are error handling in the load test (easy fix) and bash script efficiency (minor optimization). Automated review by Claude (Anthropic) · custom prompt |
🤖 Codex Code ReviewFindings (ordered by severity)
Other notes
If you want, I can propose concrete patches for the two items above. Automated review by OpenAI Codex · custom prompt |
Lines of code reportTotal lines added: Detailed view |
or in a loop depending on --endless, instead of a loop-with-break pattern.
34dcf6b to
a786cd1
Compare
…lient, tx_builder, and accounts instead of owned values, removing unnecessary clones at the call sites in run_round.
This removes the need for --no-timed; users pass --timed when they want proving time measurement, and omit it otherwise.
both the default and gpu-enabled cargo run commands.
--timed can be passed via make (e.g. make init-prover-sp1 PROVER_ARGS=--timed). Update docs to use the Makefile commands with GPU=true and PROVER_ARGS examples.
PROVER_ARGS, so timing is enabled with TIMED=true (e.g. make init-prover-sp1 GPU=true TIMED=true).
…alias. Clap already reads the env var, so no Makefile plumbing is needed. Reverts the TIMED/PROVER_ARGS additions to the Makefile targets.
…ummary instead of CSV. Add load-test Makefile target with env var configuration. Update docs to use Makefile commands throughout, add mempool limit warning, note that L2 must be running when collecting results, and reference PROVER_CLIENT_TIMED env var instead of --timed flag.
root Makefile target instead. Update docs to run make load-test from the repo root.
script. The report now includes a Server Specs section with hardware info detected from /proc/cpuinfo, /proc/meminfo, sysctl, and nvidia-smi.
user edits to prover benchmarking guide.
by default) in the prover benchmarking workflow.
batch assignments by whether the prover's type is needed. Previously the coordinator assigned a batch whenever ANY required proof was missing, causing provers to waste time re-proving batches they had already proved (e.g. an SP1 prover proving a batch that only needed an exec proof). The ProverBackend trait now has a prover_type() method, implemented for each backend (todo!() for ZisK and OpenVM which lack ProverType variants). The prover client stores its type at startup and sends it with every BatchRequest. The coordinator checks needed_proof_types.contains() before assigning. Also fixes the deploy-l1-sp1 Makefile target which used the removed --sp1.deploy-verifier flag — replaced with --sp1 and removed dummy verifier addresses for other backends. Updated benchmarking docs to reference deploy-l1-sp1 for SP1 benchmarks.
already exposes it via the prover_type() trait method.
… todo!() for ZisK and OpenVM backends (no ProverType variant exists for them yet) and call self.backend.prover_type() directly instead of storing it as a field.
…oad_test formatting. Commit 517e535 added prover_type to BatchRequest but didn't update the TDX quote-gen sender (outside the workspace), breaking the tdx lint/build CI jobs. The load_test formatting was also off since it's outside the main workspace.
…arted immediately after the L2 is up, not just before the prover. The prover proves batches faster than the sequencer fills them with transactions, so any delay in starting the load test results in empty batches being proved.
empty batches. The block producer and committer build and commit empty blocks/batches faster than the load test can start sending transactions (especially for erc20/fibonacci/io-heavy which have a contract deployment setup phase). By pre-compiling the binary in step 1b and running it directly in step 3, the load test starts immediately when the L2 is up.
…t recompilation. Previously each step (deploy, L2 start, prover start) triggered its own `cargo run` which recompiled the binary. Now step 1b builds everything once with all needed features, and all subsequent steps use `target/release/ethrex` directly. Also made the docs backend-agnostic (not SP1-specific) with placeholder flags and tables for SP1/RISC0/Exec, and made the branch/commit prompt mandatory (no default).
ProofCoordinator: the field is used again in PR #6157 and is preserved to simplify merge conflicts.
…name sp1_bench_metrics.sh to bench_metrics.sh since the script is backend-agnostic, fix the load test targets table in the workflow doc to show -t flag values instead of Makefile targets, and add a note clarifying that the hardcoded keys in the deploy command are default test keys from fixtures/.
fix the load test progress percentage to be relative to the current round (not the absolute nonce), make run_round return Result so endless mode survives individual round failures, remove a duplicate debug log in the proof coordinator, add bash 4+ note to bench_metrics.sh, and clarify that LOAD_TEST_RPC_URL must point to the L2 port for benchmarking.
backend that is not in the deployment's required proof types, instead of silently sending an empty batch response. The prover now logs an error and surfaces the rejection clearly, rather than endlessly polling for batches it will never be assigned.
…ndant: we already check whether this specific prover type's proof exists for the batch, and we early-return if the prover type is not needed at all. If the current prover's proof exists, the batch is skipped regardless of what other proof types are missing.
if/else block. Each condition (proof already exists, batch not found, no input for version) is checked and returned independently.
get_prover_input_by_batch_and_version already returns None when the batch does not exist. Also send an empty batch response instead of NoBatchForVersion when no input is found, since the prover just needs to know there is nothing to prove right now.
that unneeded prover types are rejected immediately without hitting the database.
🤖 Kimi Code ReviewReview SummaryThis PR introduces prover benchmarking capabilities and fixes several issues in the L2 prover system. Here are the key findings: ✅ Good Changes
|
🤖 Codex Code ReviewFindings (ordered by severity)
Notes / minor observations
If you want, I can dig into Automated review by OpenAI Codex · custom prompt |
🤖 Claude Code ReviewNow I have enough context to provide a comprehensive review. Let me compile my findings: PR Review: feat(l2): add prover benchmarking tooling and documentationOverviewThis PR adds comprehensive tooling for reproducible prover benchmarking, including timing instrumentation, metrics collection, load testing improvements, and documentation. The changes are well-structured and the feature additions are valuable for performance measurement. Critical Issues1. Proof Coordinator Logic Change - Potential Race Condition Risk
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds comprehensive tooling and documentation for benchmarking prover performance on the ethrex L2. The changes enable reproducible and ergonomic prover benchmarking workflows by adding structured timing logs, a results collection script, load test improvements, and detailed documentation for both manual and agent-assisted workflows.
Changes:
- Adds prover instrumentation with
prove_timed()method and--timedCLI flag for measuring proving time - Enhances proof coordinator to filter provers by type, preventing unnecessary work assignments when specific proof types aren't needed
- Improves load test with pending nonce support, endless mode, and environment variable configuration for easier automation
- Fixes two bugs: VK file paths after guest program relocation and off-by-one error in batch_size metric
- Adds comprehensive documentation for prover benchmarking workflows
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tooling/load_test/src/main.rs | Adds endless mode, uses pending nonces for consecutive runs, adds env var support, refactors into run_round function |
| scripts/bench_metrics.sh | New benchmark results collection script that parses prover logs and enriches with Prometheus batch metadata |
| docs/workflows/prover_benchmarking.md | New agent workflow documentation for remote server benchmarking |
| docs/l2/prover-benchmarking.md | New user guide for prover benchmarking on local/remote setups |
| crates/l2/prover/src/prover.rs | Adds timed proving support controlled by --timed flag |
| crates/l2/prover/src/config.rs | Adds timed field to ProverConfig |
| crates/l2/prover/src/backend/mod.rs | Adds prover_type() method to ProverBackend trait |
| crates/l2/prover/src/backend/*.rs | Implements prover_type() method for all backend implementations |
| crates/l2/sequencer/proof_coordinator.rs | Adds prover type filtering logic to skip unnecessary work assignments |
| crates/l2/common/src/prover.rs | Adds prover_type field to BatchRequest and new ProverTypeNotNeeded response |
| crates/l2/tee/quote-gen/src/sender.rs | Updates get_batch to include prover_type: ProverType::TDX in BatchRequest |
| crates/l2/sequencer/l1_committer.rs | Fixes off-by-one error in batch_size metric calculation |
| cmd/ethrex/l2/deployer.rs | Fixes VK file paths after guest program relocation to crates/guest-program/bin |
| cmd/ethrex/l2/options.rs | Adds --timed CLI option for prover client |
| crates/l2/Makefile | Simplifies deploy-l1-sp1 target to use --sp1 flag |
Comments suppressed due to low confidence (1)
crates/l2/tee/quote-gen/src/sender.rs:52
- The
get_batchfunction does not handle the newProofData::ProverTypeNotNeededresponse variant that was added to the protocol. When a TDX prover connects but TDX proofs are not needed by the coordinator, it will send backProverTypeNotNeeded, but this function will treat it as an unexpected response and return a generic error "Expecting ProofData::Response". Add a match arm to handleProofData::ProverTypeNotNeededand return a more descriptive error message.
pub async fn get_batch(commit_hash: String) -> Result<(u64, ProgramInput), String> {
let batch = connect_to_prover_server_wr(&ProofData::BatchRequest {
commit_hash: commit_hash.clone(),
prover_type: ProverType::TDX,
})
.await
.map_err(|e| format!("Failed to get Response: {e}"))?;
match batch {
ProofData::BatchResponse {
batch_number,
input,
..
} => match (batch_number, input) {
(Some(batch_number), Some(input)) => {
#[cfg(feature = "l2")]
let input = ProgramInput {
blocks: input.blocks,
execution_witness: input.execution_witness,
elasticity_multiplier: input.elasticity_multiplier,
blob_commitment: input.blob_commitment,
blob_proof: input.blob_proof,
fee_configs: input.fee_configs,
};
#[cfg(not(feature = "l2"))]
let input = ProgramInput {
blocks: input.blocks,
execution_witness: input.execution_witness,
};
Ok((batch_number, input))
}
_ => Err("No blocks to prove.".to_owned()),
},
ProofData::NoBatchForVersion {
commit_hash: server_code_version,
} => Err(format!(
"Next batch does not match with the current version. Server code: {}, Prover code: {}",
server_code_version, commit_hash
)),
_ => Err("Expecting ProofData::Response".to_owned()),
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Greptile OverviewGreptile SummaryThis PR adds reproducible prover benchmarking support across the L2 stack: new optional timed proving instrumentation (controlled by On the coordinator side, Confidence Score: 3/5
|
| Filename | Overview |
|---|---|
| cmd/ethrex/l2/deployer.rs | Updates default VK file paths for RISC0/SP1 to new guest-program location; otherwise unchanged. |
| cmd/ethrex/l2/options.rs | Adds --timed / PROVER_CLIENT_TIMED flag to prover client options and threads it into ProverConfig. |
| crates/l2/Makefile | Adds/adjusts localnet/prover convenience targets and env exports for benchmarking workflow. |
| crates/l2/common/src/prover.rs | Extends BatchRequest with prover_type so coordinator can filter assignments; appears consistent with coordinator usage. |
| crates/l2/prover/src/backend/exec.rs | Implements prover_type() for exec backend and related instrumentation hooks. |
| crates/l2/prover/src/backend/mod.rs | Adds prover_type() to ProverBackend trait and propagates backend parsing; enables coordinator filtering by backend type. |
| crates/l2/prover/src/backend/openvm.rs | Adds ProverBackend::prover_type() but currently unimplemented!() causing runtime panic if openvm backend is selected. |
| crates/l2/prover/src/backend/risc0.rs | Implements prover_type() for Risc0 backend; no issues found. |
| crates/l2/prover/src/backend/sp1.rs | Implements prover_type() for SP1 backend; no issues found. |
| crates/l2/prover/src/backend/zisk.rs | Adds ProverBackend::prover_type() but currently unimplemented!() causing runtime panic if zisk backend is selected. |
| crates/l2/prover/src/config.rs | Adds timed field to ProverConfig for controlling proving-time instrumentation. |
| crates/l2/prover/src/prover.rs | Wires new prover_type into BatchRequest and optional timed proving; backend-specific prover_type() panics remain a risk. |
| crates/l2/sequencer/l1_committer.rs | Fixes batch_size metric off-by-one; no issues found. |
| crates/l2/sequencer/proof_coordinator.rs | Handles prover_type in BatchRequest and skips batches for unneeded prover types; logic appears consistent with new request format. |
| crates/l2/tee/quote-gen/src/sender.rs | Adds prover_type=TDX in batch requests to coordinator; aligns with coordinator filtering behavior. |
| docs/l2/prover-benchmarking.md | Adds benchmarking guide documentation. |
| docs/workflows/prover_benchmarking.md | Adds agent workflow for benchmarking documentation. |
| scripts/bench_metrics.sh | Adds bench log parsing script for prover timing metrics; not executed in build. |
| tooling/load_test/src/main.rs | Adds pending-nonce usage, endless mode, and env var support for load test CLI. |
Sequence Diagram
sequenceDiagram
participant Prover as Prover client
participant Coord as ProofCoordinator
participant Store as Store/DB
Prover->>Coord: BatchRequest{ chain_id, last_verified_batch, prover_type }
Coord->>Store: fetch needed_proof_types + next batch/input
alt prover_type not needed
Coord-->>Prover: ProverTypeNotNeeded
else proof already exists
Coord-->>Prover: ProofAlreadyExists
else no batch/input available
Coord-->>Prover: EmptyBatch / No work
else batch assigned
Coord-->>Prover: BatchResponse{ batch, input, ... }
Prover->>Prover: prove() / prove_timed() (if --timed)
Prover->>Coord: submit proof
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
19 files reviewed, 2 comments
ArgAction::Set, not SetTrue), shut down the prover process when the coordinator rejects its type instead of retrying forever, and cache the metrics endpoint response in bench_metrics.sh to avoid 3N HTTP requests.
…mplemented!), make send_batches_proof_to_contract private, add natspec to based contract's verifyBatches noting it has no access control, update all L2 docs replacing verifyBatch references with verifyBatches, move distributed_proving.md from docs/prover/ to docs/l2/fundamentals/ since it describes the interaction between proof coordinator, proof sender, and provers rather than prover internals, and restructure the doc to be explanation-first with the testing guide at the end.
Motivation
Add tooling to make prover benchmarking workflows reproducible and ergonomic: structured timing logs, a results collection script, load test improvements, and documentation.
Description
Bug fixes:
batch_sizemetric (single-block batch reported as 0)Load test improvements:
--endlessflag for continuous load generationLOAD_TEST_RPC_URL,LOAD_TEST_TX_AMOUNT,LOAD_TEST_ENDLESSProver instrumentation:
prove_timed) and log-parsing benchmark script (scripts/bench_metrics.sh)--timedflag to prover CLI to control whether proving time is measured (default: disabled). Set viaPROVER_CLIENT_TIMEDenv var.Proof coordinator:
prover_typefield toBatchRequestso the coordinator knows which backend the client runsprover_type()method to theProverBackendtraitDocumentation:
docs/l2/prover-benchmarking.md)docs/workflows/prover_benchmarking.md)Checklist
STORE_SCHEMA_VERSION(crates/storage/lib.rs) if the PR includes breaking changes to theStorerequiring a re-sync.