Skip to content

perf: Concurrent chip proving#1254

Merged
hero78119 merged 20 commits intomasterfrom
feat/concurrent-chip-proving
Mar 2, 2026
Merged

perf: Concurrent chip proving#1254
hero78119 merged 20 commits intomasterfrom
feat/concurrent-chip-proving

Conversation

@Velaciela
Copy link
Collaborator

@Velaciela Velaciela commented Feb 11, 2026

better merge after #1250: to observe perf gain after the overlap gap in witness is reduced

CENO_SCHEDULER_WORKERS_NUM sets the number of concurrent streams, defaulting to 64.
RUST_MIN_STACK=16777216 multi-thread execution needs this

for dev & ci: CENO_GPU_MEM_TRACKING=1 CENO_CONCURRENT_CHIP_PROVING=0 to verify GPU memory estimation


Summary

This PR introduces a memory-aware parallel scheduler for chip proof generation on the GPU, enabling multiple chip proofs to run concurrently while respecting GPU VRAM limits. The key idea is a greedy backfilling algorithm that maximizes GPU utilization and eliminates long-tail latency from large chips blocking the pipeline.

Motivation

Previously, chip proofs were generated sequentially — one circuit at a time. This left significant GPU idle time, especially when small chip proofs could run in parallel with large ones. By scheduling chip proofs concurrently with preemptive memory reservation, we can significantly improve overall proving throughput.

Key Changes

1. New Scheduler Module (scheduler.rs)

  • Implements a greedy backfilling algorithm: tasks are sorted by memory requirement (descending, "big rocks first"), and the scheduler tries to fit the largest task first; if it doesn't fit, it backfills with smaller tasks.
  • Supports both sequential and concurrent execution modes, controlled by CENO_CONCURRENT_CHIP_PROVING env var (default: concurrent on GPU).
  • Uses std::thread::scope with a fixed worker pool (8 workers) and MPSC channels for task dispatch and completion notification.
  • Integrates with a CUDA memory pool (CudaMemPool) for preemptive VRAM booking/unbooking, preventing OOM during parallel execution.
  • Handles worker panics (catches unwind, converts to ZKVMError) to avoid scheduler deadlocks.

2. GPU Memory Estimation (gpu/memory.rs)

  • New module providing pre-execution GPU memory estimation for each chip proof stage:
    • estimate_trace_bytes — witness polynomial extraction + structural MLEs
    • estimate_main_witness_bytes — read/write/logup record polynomials
    • estimate_tower_bytes — tower build + prove stages
    • estimate_ecc_quark_bytes_from_num_vars — ECC quark sumcheck
    • estimate_main_constraints_bytes — GKR zerocheck + rotation sumcheck
  • estimate_chip_proof_memory computes the peak memory as: resident + max(stage temporaries) + safety margin
  • Dev-mode validation (CENO_GPU_MEM_TRACKING=1 CENO_CONCURRENT_CHIP_PROVING=0): check_gpu_mem_estimation compares estimated vs. actual GPU memory usage at each stage, with configurable tolerance (1MB under-estimate, 5MB over-estimate).

3. Standalone _impl Functions for Thread Safety

  • Extracted prove_tower_relation_impl, prove_main_constraints_impl, prove_ec_sum_quark_impl as standalone functions (no &self) from the original trait methods.
  • This eliminates Send/Sync requirements on GpuProver/ZKVMProver, enabling parallel execution without capturing &self.
  • The original trait methods now delegate to these standalone functions.
  • Added create_chip_proof_gpu_impl as a standalone GPU-specific entry point for the concurrent scheduler.

4. Deferred Witness Extraction (GPU)

  • Introduces a ChipInputPreparer trait with GPU/CPU implementations.
  • GPU path: witness MLEs and structural witnesses are extracted just-in-time per task (deferred extraction), rather than all-at-once upfront. This reduces peak VRAM usage.
  • CPU path: no-op (input is eagerly populated during task building, as before).
  • New standalone functions: extract_witness_mles_for_trace, transport_structural_witness_to_gpu.

5. Per-Thread CUDA Stream Support

  • GPU operations now use per-thread CUDA streams (get_thread_stream(), bind_thread_stream()), enabling concurrent kernel execution across worker threads.
  • All GPU allocation/kernel calls (tower build, sumcheck, filter, etc.) are updated to accept an optional stream parameter.

6. Prover Refactoring: 3-Phase Architecture

The prover's create_proof_of_shard is restructured into three clean phases:

  • Phase 1 (build_chip_tasks): Build ChipTask structs from witness data, with #[cfg] for eager (CPU) vs deferred (GPU) extraction.
  • Phase 2 (run_chip_proofs): Execute tasks via the scheduler. Concurrent mode uses standalone _impl functions; sequential mode uses self.create_chip_proof.
  • Phase 3 (collect_chip_results): Aggregate results into BTreeMap<circuit_idx, Vec<ZKVMChipProof>>, points, evaluations, and PI updates.

7. RcArc Migration

  • CpuBackend and GpuBackend references changed from Rc to Arc to support cross-thread sharing required by the concurrent scheduler.

8. Transcript Forking

  • Transcript forking is now handled internally by the scheduler: each worker clones the parent transcript, appends the task_id, and samples one extension-field element after proving. This avoids sending non-Send transcript objects across threads.

Benchmark

- APP Prove - baseline(master + witgen-perf#1250): **74.500s**
- APP Prove - concurrent chip proving: **60.800s** 
  - (1.225x speedup @ 4090 24GB block#600)

@Velaciela Velaciela force-pushed the feat/concurrent-chip-proving branch from 677f3f2 to 1d2f752 Compare February 11, 2026 02:08
@hero78119 hero78119 changed the base branch from feat/fork_script to master February 11, 2026 02:55
@Velaciela Velaciela marked this pull request as draft February 11, 2026 03:54
@Velaciela Velaciela force-pushed the feat/concurrent-chip-proving branch from 73fd202 to 97062da Compare February 12, 2026 01:44
Copy link
Collaborator

@hero78119 hero78119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! just one minor comments related to new scheduler

@Velaciela Velaciela force-pushed the feat/concurrent-chip-proving branch from 6e3a723 to e62d84a Compare February 26, 2026 07:28
@Velaciela Velaciela marked this pull request as ready for review February 27, 2026 01:45
Copy link
Collaborator

@hero78119 hero78119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bring a huge milestone for Ceno GPU utilization in base layer!
Learn a lot during the reviewing. Awesome work & LGTM 💯

let bh = BooleanHypercube::new(rotation_cyclic_group_log2);
let (left_point, right_point) = bh.get_rotation_points(&row_challenges_e);
// Capture parent thread's CUDA stream so Rayon workers can reuse it.
// TODO: GPU batch evaluation and its batch version
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: fix me notes as it make prove_rotation to be sequential as they shared same parent stream id, which is ok since prove rotation cost are low

s.spawn(move || {
loop {
let task = {
let lock = rx.lock().unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note: we prefer worker reuse instead of round-robin, but it depends on rayon thread schedule. If we require stream id to be maximal re-used, we can comback and think about how to tune this part

This only affect readibility, not affect performance

@hero78119 hero78119 added this pull request to the merge queue Mar 2, 2026
Merged via the queue into master with commit ac1922a Mar 2, 2026
5 checks passed
@hero78119 hero78119 deleted the feat/concurrent-chip-proving branch March 2, 2026 06:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants