Merged
Conversation
677f3f2 to
1d2f752
Compare
73fd202 to
97062da
Compare
hero78119
requested changes
Feb 26, 2026
Collaborator
hero78119
left a comment
There was a problem hiding this comment.
LGTM! just one minor comments related to new scheduler
hide `launch_idx` and simplify some conditioning logic
6e3a723 to
e62d84a
Compare
hero78119
approved these changes
Mar 1, 2026
Collaborator
hero78119
left a comment
There was a problem hiding this comment.
This bring a huge milestone for Ceno GPU utilization in base layer!
Learn a lot during the reviewing. Awesome work & LGTM 💯
| let bh = BooleanHypercube::new(rotation_cyclic_group_log2); | ||
| let (left_point, right_point) = bh.get_rotation_points(&row_challenges_e); | ||
| // Capture parent thread's CUDA stream so Rayon workers can reuse it. | ||
| // TODO: GPU batch evaluation and its batch version |
Collaborator
There was a problem hiding this comment.
TODO: fix me notes as it make prove_rotation to be sequential as they shared same parent stream id, which is ok since prove rotation cost are low
| s.spawn(move || { | ||
| loop { | ||
| let task = { | ||
| let lock = rx.lock().unwrap(); |
Collaborator
There was a problem hiding this comment.
A note: we prefer worker reuse instead of round-robin, but it depends on rayon thread schedule. If we require stream id to be maximal re-used, we can comback and think about how to tune this part
This only affect readibility, not affect performance
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
better merge after #1250: to observe perf gain after the overlap gap in witness is reduced
CENO_SCHEDULER_WORKERS_NUMsets the number of concurrent streams, defaulting to64.RUST_MIN_STACK=16777216multi-thread execution needs thisfor dev & ci:
CENO_GPU_MEM_TRACKING=1 CENO_CONCURRENT_CHIP_PROVING=0to verify GPU memory estimationSummary
This PR introduces a memory-aware parallel scheduler for chip proof generation on the GPU, enabling multiple chip proofs to run concurrently while respecting GPU VRAM limits. The key idea is a greedy backfilling algorithm that maximizes GPU utilization and eliminates long-tail latency from large chips blocking the pipeline.
Motivation
Previously, chip proofs were generated sequentially — one circuit at a time. This left significant GPU idle time, especially when small chip proofs could run in parallel with large ones. By scheduling chip proofs concurrently with preemptive memory reservation, we can significantly improve overall proving throughput.
Key Changes
1. New Scheduler Module (
scheduler.rs)CENO_CONCURRENT_CHIP_PROVINGenv var (default: concurrent on GPU).std::thread::scopewith a fixed worker pool (8 workers) and MPSC channels for task dispatch and completion notification.CudaMemPool) for preemptive VRAM booking/unbooking, preventing OOM during parallel execution.ZKVMError) to avoid scheduler deadlocks.2. GPU Memory Estimation (
gpu/memory.rs)estimate_trace_bytes— witness polynomial extraction + structural MLEsestimate_main_witness_bytes— read/write/logup record polynomialsestimate_tower_bytes— tower build + prove stagesestimate_ecc_quark_bytes_from_num_vars— ECC quark sumcheckestimate_main_constraints_bytes— GKR zerocheck + rotation sumcheckestimate_chip_proof_memorycomputes the peak memory as:resident + max(stage temporaries) + safety marginCENO_GPU_MEM_TRACKING=1 CENO_CONCURRENT_CHIP_PROVING=0):check_gpu_mem_estimationcompares estimated vs. actual GPU memory usage at each stage, with configurable tolerance (1MB under-estimate, 5MB over-estimate).3. Standalone
_implFunctions for Thread Safetyprove_tower_relation_impl,prove_main_constraints_impl,prove_ec_sum_quark_implas standalone functions (no&self) from the original trait methods.Send/Syncrequirements onGpuProver/ZKVMProver, enabling parallel execution without capturing&self.create_chip_proof_gpu_implas a standalone GPU-specific entry point for the concurrent scheduler.4. Deferred Witness Extraction (GPU)
ChipInputPreparertrait with GPU/CPU implementations.extract_witness_mles_for_trace,transport_structural_witness_to_gpu.5. Per-Thread CUDA Stream Support
get_thread_stream(),bind_thread_stream()), enabling concurrent kernel execution across worker threads.6. Prover Refactoring: 3-Phase Architecture
The prover's
create_proof_of_shardis restructured into three clean phases:build_chip_tasks): BuildChipTaskstructs from witness data, with#[cfg]for eager (CPU) vs deferred (GPU) extraction.run_chip_proofs): Execute tasks via the scheduler. Concurrent mode uses standalone_implfunctions; sequential mode usesself.create_chip_proof.collect_chip_results): Aggregate results intoBTreeMap<circuit_idx, Vec<ZKVMChipProof>>, points, evaluations, and PI updates.7.
Rc→ArcMigrationCpuBackendandGpuBackendreferences changed fromRctoArcto support cross-thread sharing required by the concurrent scheduler.8. Transcript Forking
task_id, and samples one extension-field element after proving. This avoids sending non-Sendtranscript objects across threads.Benchmark