perf: Concurrent chip proving by Velaciela · Pull Request #1254 · scroll-tech/ceno

Velaciela · 2026-02-11T01:57:33Z

better merge after #1250: to observe perf gain after the overlap gap in witness is reduced

CENO_SCHEDULER_WORKERS_NUM sets the number of concurrent streams, defaulting to 64.
RUST_MIN_STACK=16777216 multi-thread execution needs this

for dev & ci: CENO_GPU_MEM_TRACKING=1 CENO_CONCURRENT_CHIP_PROVING=0 to verify GPU memory estimation

Summary

This PR introduces a memory-aware parallel scheduler for chip proof generation on the GPU, enabling multiple chip proofs to run concurrently while respecting GPU VRAM limits. The key idea is a greedy backfilling algorithm that maximizes GPU utilization and eliminates long-tail latency from large chips blocking the pipeline.

Motivation

Previously, chip proofs were generated sequentially — one circuit at a time. This left significant GPU idle time, especially when small chip proofs could run in parallel with large ones. By scheduling chip proofs concurrently with preemptive memory reservation, we can significantly improve overall proving throughput.

Key Changes

1. New Scheduler Module (`scheduler.rs`)

Implements a greedy backfilling algorithm: tasks are sorted by memory requirement (descending, "big rocks first"), and the scheduler tries to fit the largest task first; if it doesn't fit, it backfills with smaller tasks.
Supports both sequential and concurrent execution modes, controlled by CENO_CONCURRENT_CHIP_PROVING env var (default: concurrent on GPU).
Uses std::thread::scope with a fixed worker pool (8 workers) and MPSC channels for task dispatch and completion notification.
Integrates with a CUDA memory pool (CudaMemPool) for preemptive VRAM booking/unbooking, preventing OOM during parallel execution.
Handles worker panics (catches unwind, converts to ZKVMError) to avoid scheduler deadlocks.

2. GPU Memory Estimation (`gpu/memory.rs`)

New module providing pre-execution GPU memory estimation for each chip proof stage:
- estimate_trace_bytes — witness polynomial extraction + structural MLEs
- estimate_main_witness_bytes — read/write/logup record polynomials
- estimate_tower_bytes — tower build + prove stages
- estimate_ecc_quark_bytes_from_num_vars — ECC quark sumcheck
- estimate_main_constraints_bytes — GKR zerocheck + rotation sumcheck
estimate_chip_proof_memory computes the peak memory as: resident + max(stage temporaries) + safety margin
Dev-mode validation (CENO_GPU_MEM_TRACKING=1 CENO_CONCURRENT_CHIP_PROVING=0): check_gpu_mem_estimation compares estimated vs. actual GPU memory usage at each stage, with configurable tolerance (1MB under-estimate, 5MB over-estimate).

3. Standalone `_impl` Functions for Thread Safety

Extracted prove_tower_relation_impl, prove_main_constraints_impl, prove_ec_sum_quark_impl as standalone functions (no &self) from the original trait methods.
This eliminates Send/Sync requirements on GpuProver/ZKVMProver, enabling parallel execution without capturing &self.
The original trait methods now delegate to these standalone functions.
Added create_chip_proof_gpu_impl as a standalone GPU-specific entry point for the concurrent scheduler.

4. Deferred Witness Extraction (GPU)

Introduces a ChipInputPreparer trait with GPU/CPU implementations.
GPU path: witness MLEs and structural witnesses are extracted just-in-time per task (deferred extraction), rather than all-at-once upfront. This reduces peak VRAM usage.
CPU path: no-op (input is eagerly populated during task building, as before).
New standalone functions: extract_witness_mles_for_trace, transport_structural_witness_to_gpu.

5. Per-Thread CUDA Stream Support

GPU operations now use per-thread CUDA streams (get_thread_stream(), bind_thread_stream()), enabling concurrent kernel execution across worker threads.
All GPU allocation/kernel calls (tower build, sumcheck, filter, etc.) are updated to accept an optional stream parameter.

6. Prover Refactoring: 3-Phase Architecture

The prover's create_proof_of_shard is restructured into three clean phases:

Phase 1 (build_chip_tasks): Build ChipTask structs from witness data, with #[cfg] for eager (CPU) vs deferred (GPU) extraction.
Phase 2 (run_chip_proofs): Execute tasks via the scheduler. Concurrent mode uses standalone _impl functions; sequential mode uses self.create_chip_proof.
Phase 3 (collect_chip_results): Aggregate results into BTreeMap<circuit_idx, Vec<ZKVMChipProof>>, points, evaluations, and PI updates.

7. `Rc` → `Arc` Migration

CpuBackend and GpuBackend references changed from Rc to Arc to support cross-thread sharing required by the concurrent scheduler.

8. Transcript Forking

Transcript forking is now handled internally by the scheduler: each worker clones the parent transcript, appends the task_id, and samples one extension-field element after proving. This avoids sending non-Send transcript objects across threads.

Benchmark

- APP Prove - baseline(master + witgen-perf#1250): **74.500s**
- APP Prove - concurrent chip proving: **60.800s** 
  - (1.225x speedup @ 4090 24GB block#600)

hero78119

LGTM! just one minor comments related to new scheduler

ceno_zkvm/src/scheme/gpu/mod.rs

ceno_zkvm/src/scheme/scheduler.rs

hide `launch_idx` and simplify some conditioning logic

hero78119

This bring a huge milestone for Ceno GPU utilization in base layer!
Learn a lot during the reviewing. Awesome work & LGTM 💯

hero78119 · 2026-03-01T08:00:04Z

gkr_iop/src/gkr/layer/gpu/mod.rs

    let bh = BooleanHypercube::new(rotation_cyclic_group_log2);
    let (left_point, right_point) = bh.get_rotation_points(&row_challenges_e);
+    // Capture parent thread's CUDA stream so Rayon workers can reuse it.
+    // TODO: GPU batch evaluation and its batch version


TODO: fix me notes as it make prove_rotation to be sequential as they shared same parent stream id, which is ok since prove rotation cost are low

hero78119 · 2026-03-01T08:24:06Z

ceno_zkvm/src/scheme/scheduler.rs

+                s.spawn(move || {
+                    loop {
+                        let task = {
+                            let lock = rx.lock().unwrap();


A note: we prefer worker reuse instead of round-robin, but it depends on rayon thread schedule. If we require stream id to be maximal re-used, we can comback and think about how to tune this part

This only affect readibility, not affect performance

Velaciela force-pushed the feat/concurrent-chip-proving branch from 677f3f2 to 1d2f752 Compare February 11, 2026 02:08

hero78119 changed the base branch from feat/fork_script to master February 11, 2026 02:55

Velaciela marked this pull request as draft February 11, 2026 03:54

Velaciela force-pushed the feat/concurrent-chip-proving branch from 73fd202 to 97062da Compare February 12, 2026 01:44

hero78119 requested changes Feb 26, 2026

View reviewed changes

ceno_zkvm/src/scheme/gpu/mod.rs Outdated Show resolved Hide resolved

ceno_zkvm/src/scheme/scheduler.rs Outdated Show resolved Hide resolved

Velaciela and others added 11 commits February 26, 2026 15:27

dev0210

6dddb15

cleanup: thread_local stream

55ea3ad

perf: thread_local stream

2b503b7

refactor: estimate_chip_proof_memory

14451b1

refactor: prover, scheduler

e8e228a

fix: ensure cuda context

c2ae43d

dev mode: check_mem_estimation

e380a56

refactor: check_gpu_mem_estimation

a5a07a1

fmt

5a46ac5

CENO_GPU_MEM_TRACKING

47329aa

misc: more comments and slightly adjust flow (#1256)

e62d84a

hide `launch_idx` and simplify some conditioning logic

Velaciela force-pushed the feat/concurrent-chip-proving branch from 6e3a723 to e62d84a Compare February 26, 2026 07:28

Velaciela added 5 commits February 27, 2026 09:42

error handling

cad3a64

minor

cf6376f

lints

cd7e00e

cleanup

3cbef86

fmt

077b0ce

Velaciela marked this pull request as ready for review February 27, 2026 01:45

Velaciela added 3 commits February 27, 2026 12:04

more error message

92d1895

naming

15d77e5

num workers: stream_pool_size

1d2d543

hero78119 approved these changes Mar 1, 2026

View reviewed changes

minor: env variables

9cbc15c

hero78119 added this pull request to the merge queue Mar 2, 2026

Merged via the queue into master with commit ac1922a Mar 2, 2026
5 checks passed

hero78119 deleted the feat/concurrent-chip-proving branch March 2, 2026 06:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Concurrent chip proving#1254

perf: Concurrent chip proving#1254
hero78119 merged 20 commits intomasterfrom
feat/concurrent-chip-proving

Velaciela commented Feb 11, 2026 •

edited

Loading

Uh oh!

hero78119 left a comment

Uh oh!

Uh oh!

Uh oh!

hero78119 left a comment

Uh oh!

hero78119 Mar 1, 2026

Uh oh!

hero78119 Mar 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Velaciela commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Key Changes

1. New Scheduler Module (scheduler.rs)

2. GPU Memory Estimation (gpu/memory.rs)

3. Standalone _impl Functions for Thread Safety

4. Deferred Witness Extraction (GPU)

5. Per-Thread CUDA Stream Support

6. Prover Refactoring: 3-Phase Architecture

7. Rc → Arc Migration

8. Transcript Forking

Benchmark

Uh oh!

hero78119 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hero78119 left a comment

Choose a reason for hiding this comment

Uh oh!

hero78119 Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

hero78119 Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Velaciela commented Feb 11, 2026 •

edited

Loading

1. New Scheduler Module (`scheduler.rs`)

2. GPU Memory Estimation (`gpu/memory.rs`)

3. Standalone `_impl` Functions for Thread Safety

7. `Rc` → `Arc` Migration