10x cheaper C2/PRU: CuZK Proving engine by magik6k · Pull Request #1043 · filecoin-project/curio

magik6k · 2026-02-20T23:43:46Z

Summary

Integrate the cuzk persistent GPU SNARK proving daemon with Curio's task scheduler via gRPC. When enabled, Curio delegates PoRep C2, SnapDeals prove, and PSProve SNARK computations to the cuzk daemon instead of spawning per-proof child processes through ffiselect.

Add gRPC client (lib/cuzk/) and SealCalls methods (lib/ffi/cuzk_funcs.go) for PoRep and SnapDeals
Wire cuzk into PoRep, SnapDeals, and PSProve tasks with backpressure via GetStatus
Add make cuzk build target (NOT in default BINS — CI unaffected)
Vendor bellpepper-core and supraseal-c2 crate files so git clone && make cuzk works
Add user-facing documentation under documentation/en/experimental-features/

What is cuzk

cuzk is a persistent Rust daemon that keeps Groth16 SRS parameters (~47 GiB for 32 GiB PoRep) resident in CUDA-pinned host memory across proofs. The current ffiselect model spawns a fresh process per proof, loading the SRS from scratch each time (30-90s). cuzk eliminates this overhead entirely.
Beyond SRS residency, cuzk implements a 13-phase optimization pipeline that achieves 2.8x throughput over the ffiselect baseline (37.7s/proof vs ~89s on RTX 5070 Ti). The key architectural contributions are pipelined partition synthesis, dual-worker GPU interlock, PCIe transfer optimization, and a split async GPU proving API.

Architecture

Curio (Go)                          cuzk daemon (Rust/CUDA)
─────────────                       ──────────────────────
tasks/seal,snap,proofshare          persistent process
        │                                   │
    gRPC client ────── unix/TCP ──────► gRPC server
  (lib/cuzk/client.go)                     │
                                    ┌──────┴───────┐
                                    │   Scheduler   │
                                    │  (priority Q) │
                                    └──────┬───────┘
                                           │
                                    ┌──────┴───────┐
                                    │  GPU Workers  │
                                    │  (per device) │
                                    └──────────────┘

Vanilla proofs are generated locally in Curio (requires sector data on disk), then sent to cuzk for SNARK computation. The returned proof is verified locally before submission.

Pipelining

The cuzk engine pipelines work at three levels:

1. Partition-Level Synthesis → GPU Pipeline (Phase 7)

Instead of synthesizing all 10 partitions of a sector as a batch before any GPU work begins (the ffiselect model), cuzk spawns partition_workers concurrent synthesis tasks. Each produces a single partition's ProvingAssignment (~13.6 GiB) and sends it through a bounded channel to the GPU worker:

Synthesis Workers:  Part 0Part 2 ...
                         │      │      │
                         ▼      ▼      ▼
GPU Channel:        ─── P0 P1 P2 ───►
                                        │
GPU Worker:                         Prove P0Prove P2 ...

The GPU processes partition N while workers synthesize partition N+1..N+k. With partition_workers=10, all 10 partitions synthesize concurrently and the GPU is continuously fed.

2. Dual-Worker GPU Interlock (Phase 8)

Inside generate_groth16_proofs_c(), each partition has ~1.3s of CPU preprocessing (pointer setup, bitmap population) before ~3.3s of CUDA kernels, followed by ~0.7s of CPU epilogue. Phase 8 narrows the C++ mutex to cover only the CUDA kernel region and runs two GPU workers per device:

Worker A: CPU prepepilogue══ CUDA ══
Worker B:          CPU prep──epilogue
GPU:               ████ A ████████ B ████

This achieves 100% GPU utilization — zero idle gaps between partitions in steady state.

3. Split Async GPU API (Phase 12)

Phase 12 splits the monolithic C++ prove call into prove_start() (GPU kernels) + finalize() (b_g2_msm CPU + proof assembly). The GPU worker releases the lock ~1.7s earlier per partition, immediately picking up the next synthesized partition while b_g2_msm runs in a spawned finalizer task:

GPU Worker: ══ CUDA ════ CUDA ══
Finalizer:        b_g2_msm  b_g2_msm  b_g2_msm

Memory Management

SRS Residency

The daemon pre-populates GROTH_PARAM_MEMORY_CACHE at startup. Since the process is long-lived, the ~47 GiB SRS stays pinned in CUDA host memory across all proofs. No per-proof disk I/O.

Per-Partition Working Set

Memory is proportional to partition_workers, not total partitions:

Pipeline stage	Per-partition memory	Notes
During synthesis	~16 GiB	12 GiB a/b/c + 4 GiB aux
After prove_start	~4 GiB	a/b/c freed immediately, only aux + density remain
Pending finalize	~4 GiB	Held by finalizer task
The formula: Peak RSS ≈ 69 + (partition_workers × 20) GiB
Validated configurations:

128 GiB system: pw=2, gw=1 → 110 GiB peak, 152s/proof
256 GiB system: pw=7, gw=1 → 208 GiB peak, 53s/proof
512 GiB system: pw=12, gw=2 → 400 GiB peak, 37.7s/proof

Backpressure

Three mechanisms prevent OOM at high concurrency:

Early a/b/c free: prove_start() clears 12 GiB/partition immediately after GPU upload
Channel capacity auto-scaling: bounded to max(synthesis_lookahead, partition_workers)
Partition semaphore held through send: limits total in-flight synthesis outputs

CPU Locking / GPU Mutex

The C++ generate_groth16_proofs_c() originally used a static std::mutex that serialized the entire function. cuzk introduces:

Heap-allocated mutex (create_gpu_mutex() / destroy_gpu_mutex() FFI): one per physical GPU, managed by the engine. Passed through FFI as *mut c_void.
Narrowed scope: acquired before per-GPU CUDA kernel launch, released after kernels complete but before prep_msm_thread.join() — b_g2_msm and proof assembly run outside the lock.
Backward compatible: if gpu_mtx is null, falls back to the function-local static mutex (for non-engine callers).
The dual-worker interlock (2 workers per GPU) alternates lock acquisition so Worker B's CPU prep runs while Worker A holds the lock for CUDA kernels, and vice versa.

Task Integration Details

When [Cuzk] Address is set in Curio config:

Behavior	Change
`TypeDetails()`	GPU requirement zeroed, RAM set to 1 GiB (resource decisions delegated to cuzk)
`CanAccept()`	Queries `GetStatus` → rejects if `totalPending >= MaxPending`
`Do()`	Generates vanilla proof locally → sends to cuzk via `Prove` RPC → verifies returned proof locally
When `Address` is empty (default), all tasks behave exactly as before. No behavioral change for existing deployments.

Build

make cuzk          # builds extern/cuzk → ./cuzk binary (~1m51s from scratch)
make install-cuzk  # installs to /usr/local/bin
make clean         # includes cargo clean in extern/cuzk
make cuzk is NOT in the default BINS or BUILD_DEPS targets, so CI (which has no CUDA) is unaffected. Requires nvcc and cargo.

Files Changed

New files:

lib/cuzk/client.go — gRPC client wrapper (connect, Prove, GetStatus, HasCapacity)
lib/cuzk/proving.pb.go, proving_grpc.pb.go — generated protobuf/gRPC stubs
lib/ffi/cuzk_funcs.go — PoRepSnarkCuzk, ProveUpdateCuzk on SealCalls
documentation/en/experimental-features/cuzk-proving-daemon.md — user guide
Modified files:
deps/config/types.go — CuzkConfig struct + defaults
cmd/curio/tasks/tasks.go — creates cuzk.Client, passes to task constructors
tasks/seal/task_porep.go — cuzkClient field, Do/CanAccept/TypeDetails branches
tasks/snap/task_prove.go — same pattern
tasks/proofshare/task_prove.go — same + threaded through computeProof→computePoRep/computeSnap
Makefile — cuzk build/install/clean targets
.gitignore — /cuzk binary
Vendored crate files (for git clone && make cuzk):
extern/bellpepper-core/ — 13 files (full crate: Cargo.toml, src/, licenses)
extern/supraseal-c2/ — 8 files (Cargo.toml, build.rs, Cargo.lock, tests)

Implement the cuzk proving engine as a Rust workspace in extern/cuzk/ with 5 crates (proto, core, server, daemon, bench) and full gRPC API. Phase 0 delivers: - gRPC daemon (TCP + Unix socket) with 8 RPC endpoints - Real PoRep C2 proving via filecoin-proofs-api + SupraSeal CUDA backend - SRS parameter residency via GROTH_PARAM_MEMORY_CACHE (lazy populate) - Priority scheduler with binary heap queue - Prometheus metrics endpoint - Bench tool for single proof submission, status, preload, metrics E2E validated: Two consecutive 32GiB PoRep C2 proofs on RTX 5070 Ti — 116.8s cold (SRS from disk) → 92.8s warm (SRS cached), 20.5% improvement. Both produced valid 1920-byte Groth16 proofs.

…f fix Improve the cuzk daemon's debuggability and operational readiness for Phase 1 multi-GPU work: Observability: - Add tracing spans (info_span) with job_id correlation throughout prover and engine; upstream filecoin-proofs logs now tagged per-job - Split timing into deserialize vs proving (monolithic in Phase 0) - Per proof-kind Prometheus counters and duration summaries - GPU detection via nvidia-smi in GetStatus RPC (name, VRAM) - Running job info shown in status and annotated on GPU Correctness: - Fix AwaitProof to register late listeners (was broken, always 404) - Graceful shutdown via watch channel (drain, finish current proof) - Per-kind completed/failed counters with ring buffer for durations Tooling: - Add 'batch' command to cuzk-bench (sequential + concurrent modes, throughput stats with avg/min/max/proofs-per-min) - Refactor bench client connection into shared connect() helper - Add cuzk.example.toml with documented configuration E2E validated: 32GiB PoRep C2 proof completes in ~110s with full job_id-correlated logging and per-kind metrics.

…heduling Wire up WinningPoSt, WindowPoSt, and SnapDeals provers via filecoin-proofs-api: - prove_winning_post: generate_winning_post_with_vanilla - prove_window_post: generate_single_window_post_with_vanilla (per-partition) - prove_snap_deals: generate_empty_sector_update_proof_with_vanilla Multi-GPU worker pool: - Auto-detect GPUs via nvidia-smi or use config gpus.devices list - Spawn one async worker loop per GPU with CUDA_VISIBLE_DEVICES isolation - Per-worker SRS affinity tracking (last_circuit_id for future routing) Proto/API updates: - Add repeated bytes vanilla_proofs field for PoSt/SnapDeals multi-proof inputs - Rename SnapDeals fields to comm_r_old/comm_r_new/comm_d_new (raw 32-byte) - Registered proof type enum conversion (FFI V1_1 ↔ proofs-api V1_2 mapping) Bench tool updated: - Supports all proof types with --vanilla (JSON array of base64 proofs) - New flags: --registered-proof, --randomness, --comm-r-old/new, --comm-d-new 8 unit tests pass, 0 warnings, clean cargo check --no-default-features.

…napDeals Add gen-vanilla subcommand to cuzk-bench for generating vanilla proof test data from existing sealed sector data. This completes Phase 1 by enabling end-to-end testing of all four proof types (WinningPoSt, WindowPoSt, SnapDeals) without requiring Go/Curio. Three sub-subcommands: - winning-post: challenge selection + Merkle inclusion proofs (66 challenges) - window-post: fallback challenges + vanilla proofs (10 challenges) - snap-prove: partition proofs from original + updated sector data (16 partitions) Key implementation details: - filecoin-proofs-api added as optional dep behind 'gen-vanilla' feature flag - CID commitment parsing via cid crate (bagboea4b5abc... → [u8;32]) - commdr.txt file format parsing (d:<CID> r:<CID>) - Output format: JSON array of base64 strings (matches Go json.Marshal([][]byte)) - CPU-only, no GPU required (--no-default-features --features gen-vanilla) Validated against /data/32gbench/ golden data: - WinningPoSt: 164KB vanilla proof, 218KB JSON output - WindowPoSt: 25KB vanilla proof, 33KB JSON output - SnapDeals: 16 × 562KB partition proofs, 12MB JSON output 5 new unit tests (CID parsing, commdr format, JSON round-trip).

Fork bellperson 0.26.0 into extern/bellperson/ with minimal changes to expose the synthesis/GPU split point for pipelined proving: bellperson changes (3 files, ~130 lines changed): - prover/mod.rs: Make ProvingAssignment struct and all fields pub - prover/supraseal.rs: Make synthesize_circuits_batch() pub, add new prove_from_assignments() function (extracted GPU-phase code) - groth16/mod.rs: Re-export ProvingAssignment, synthesize_circuits_batch, prove_from_assignments under cuda-supraseal feature The internal two-phase architecture was already clean — synthesis runs circuit.synthesize() on CPU (rayon parallel), producing ProvingAssignment with a/b/c evaluation vectors + density trackers. GPU phase packs these into raw pointer arrays and calls supraseal_c2::generate_groth16_proof(). We simply expose both phases as separate public functions. cuzk workspace changes: - Cargo.toml: Add [patch.crates-io] for bellperson fork, add bellperson as workspace dependency - Cargo.lock: Updated to use local bellperson Also includes cuzk-phase2-design.md with complete Phase 2 design: - Per-partition pipeline strategy (13.6 GiB intermediate state instead of 136 GiB for all 10 partitions) - Memory budget analysis for 128 GiB vs 256 GiB machines - SRS manager design using SuprasealParameters directly - 7-step implementation plan - Call chain comparison (Phase 1 monolithic vs Phase 2 pipelined) All 8 existing cuzk tests pass. Zero new warnings from our changes.

Implement the core Phase 2 infrastructure: split monolithic seal_commit_phase2() into separate CPU synthesis and GPU proving phases, connected via a pipeline. New modules: - srs_manager.rs: Direct SRS loading via SuprasealParameters (bypasses GROTH_PARAM_MEMORY_CACHE). CircuitId enum maps proof types to exact .params filenames. Supports preload, evict, memory budget tracking. - pipeline.rs: Per-partition pipelined PoRep C2 proving. Each of the 10 partitions is synthesized individually (~13.6 GiB intermediate state vs ~136 GiB for all 10 at once), then proven on GPU via bellperson's split API (synthesize_circuits_batch → prove_from_assignments). Enables PoRep pipelining on 128 GiB machines. Engine changes: - Engine now supports pipeline.enabled config flag - When enabled, PoRep C2 jobs use pipelined prover with SrsManager - When disabled, falls back to Phase 1 monolithic prover - SRS preloading uses SrsManager in pipeline mode Config additions: - [pipeline] section: enabled, synthesis_lookahead - synthesis_lookahead controls backpressure (partitions buffered) Dependencies: - Added direct deps on filecoin-proofs, storage-proofs-{core,porep,post,update}, bellperson (fork), blstrs, ff, rayon, rand_core, filecoin-hashers - Correct feature flag propagation (cuda-supraseal for core+bellperson, cuda for porep/post/update which lack cuda-supraseal) Tests: 15 pass (12 existing + 3 new), 0 warnings from cuzk code. Compiles with --no-default-features (no GPU required for check builds).

Rewrite pipeline.rs to use batch synthesis (all 10 PoRep partitions in one rayon-parallel call + single GPU pass) instead of per-partition sequential mode. This matches monolithic performance (~91s vs ~93s) while enabling cross-proof overlap in the next step. Add pipelined synthesis/prove functions for all 4 proof types: - PoRep C2: batch mode (synthesize_porep_c2_batch + gpu_prove) - WinningPoSt: inlined circuit construction (no private API needed) - WindowPoSt: single-partition inlined circuit construction - SnapDeals: all-partition circuit construction Other changes: - engine.rs: route all proof types through pipeline when enabled - prover.rs: make 4 helper functions pub for pipeline.rs use - Add bincode dep for PoSt/SnapDeals vanilla proof deserialization

Restructure the engine to use a two-stage pipeline architecture when pipeline mode is enabled: Stage 1 (synthesis task): Pulls requests from the scheduler, runs CPU-bound circuit synthesis on a blocking thread, pushes the SynthesizedJob (intermediate state + SRS ref) to a bounded channel. Stage 2 (GPU workers): One per GPU, pull SynthesizedJob from the shared channel, run gpu_prove on a blocking thread pinned to their GPU via CUDA_VISIBLE_DEVICES, complete the job. The bounded channel (capacity = synthesis_lookahead config, default 1) provides backpressure: when GPU workers are busy and the channel is full, the synthesis task blocks — preventing OOM from unbounded pre-synthesized proofs. For PoRep 32G under continuous load, this enables: synth(N) | GPU(N) + synth(N+1) | GPU(N+1) + synth(N+2) | ... Steady-state: ~55s/proof (synthesis-bound) vs ~91s sequential When pipeline.enabled = false, falls back to Phase 1 monolithic workers (no overlap, full cycle per GPU worker). Also updates the example config with improved pipeline documentation.

Add batch collector and multi-sector synthesis to the pipeline engine. When max_batch_size > 1, same-type PoRep requests are accumulated and processed as a single combined synthesis + GPU proving pass, amortizing fixed GPU costs and improving SM utilization. New files: - batch_collector.rs: Accumulates same-circuit-type proof requests, flushes on max_batch_size or max_batch_wait_ms timeout. PoRep and SnapDeals are batchable; PoSt types bypass the collector entirely. Pipeline changes: - synthesize_porep_c2_multi(): Takes N sectors' C1 outputs, builds all N×10 partition circuits, synthesizes in one batch call. Returns combined SynthesizedProof + sector_boundaries for splitting results. - split_batched_proofs(): Splits concatenated GPU output back into per-sector proof byte vectors using sector_boundaries. Engine changes: - Synthesis task now uses BatchCollector for batchable proof types. Races scheduler delivery against batch timeout. Non-batchable types (WinningPost, WindowPost) preempt-flush any pending batch and process immediately. - SynthesizedJob extended with batch_requests and sector_boundaries. - GPU worker handles batched results: splits proof output, notifies each sector's individual caller with its own proof bytes and timings. Config: - scheduler.max_batch_size controls batch limit (1=disabled, 2-3 typical) - scheduler.max_batch_wait_ms controls accumulation window Backward compatible: max_batch_size=1 (default) preserves Phase 2 single-sector behavior exactly. All 25 tests pass, 0 cuzk warnings.

…oughput All Phase 3 E2E tests pass on RTX 5070 Ti: - Timeout flush: BatchCollector correctly flushes after 30s wait - Batch=2: 2 sectors synthesized as 20 circuits in 55s (same as 10), GPU 69s, yielding 62.7s/proof (1.42x vs baseline 89s) - Overflow: 3 proofs with batch=2 shows correct batch+overflow+pipeline - Non-batchable: WinningPoSt bypasses BatchCollector (0.8s total) Memory: batch=2 peaks at 360 GiB (vs 203 GiB for single proof). Updated roadmap table with measured numbers.

Synthesis optimizations (55.4s → 50.9s, -8.3%): - Boolean::add_to_lc/sub_from_lc: eliminate temporary LC allocations in circuit gadget hot paths (Boolean::lc creates a fresh Vec on every call; the new methods append directly to an existing LC) - Patched: UInt32::addmany, Num::add_bool_with_coeff, Boolean::enforce_equal, Boolean::sha256_ch, Boolean::sha256_maj, lookup3_xy, lookup3_xy_with_conditional_negation - Vec recycling pool in ProvingAssignment::enforce for the 6 LC buffers - Software prefetch in eval_with_trackers and LinearCombination::eval - perf stat: 91B fewer instructions (-15.3%), 18.6B fewer branches (-26.7%) GPU async deallocation (36s → 26s bellperson wrapper, -10s): - Root cause: ~37 GB of C++ vectors (split_vectors, tail_msm_bases) and ~130 GB of Rust Vecs (ProvingAssignment a/b/c) freed synchronously in destructors after GPU proving, blocking return for ~10s of munmap() calls - C++ fix: move split_vectors + tail_msm bases into detached std::thread - Rust fix: spawn thread to drop provers/input_assignments/aux_assignments - CUDA internal timing unchanged (~26s); overhead was pure deallocation Also: A4 (parallel B_G2 CPU MSM), D4 (per-MSM window objects), CUDA timing instrumentation, synth-only microbenchmark tool. E2E 32 GiB PoRep C2 on RTX 5070 Ti: 88.9s → 77.2s (-13.2%)

Pre-allocate ProvingAssignment Vecs (a, b, c, aux_assignment) to their final capacity using hints cached from the first synthesis. Eliminates ~27 reallocation cycles per Vec per circuit. Benchmarked: no measurable impact on 32 GiB PoRep C2 (50.65s with and without hints). Rust's geometric doubling amortizes well at our scale, and the ~265 GB of theoretical redundant copies are overlapped with computation across 10 parallel circuits on 96 cores. Kept as defensive code for memory-constrained environments.

Replace full circuit synthesis (alloc+enforce) with two-phase approach: 1. WitnessCS: witness-only generation (enforce is no-op) 2. CSR MatVec: pre-compiled sparse matrix × witness vector New cuzk-pce crate with: - RecordingCS: captures R1CS structure into CSR format (with tagged column encoding to handle interleaved alloc_input/enforce) - CsrMatrix/PreCompiledCircuit: serializable CSR storage - spmv_parallel: row-parallel sparse MatVec with rayon - evaluate_pce: builds witness vector, evaluates A*w, B*w, C*w - PreComputedDensity: density bitmaps extracted from CSR structure Pipeline integration: - synthesize_auto() dispatcher: PCE fast path when cached, old path otherwise - Static OnceLock caches per circuit type (porep-32g, winning-post, etc.) - ProvingAssignment::from_pce() constructor in bellperson fork - All 6 synthesis call sites switched to synthesize_auto() Benchmark (pce-bench subcommand): - Correctness: all 10 circuits × 130M constraints match bit-for-bit - Baseline synthesis: 50.4s (10 circuits, old path) - PCE synthesis: 35.5s (26.5s witness + 8.8s MatVec) - Speedup: 1.42x - PCE extraction: 46.9s (one-time cost, amortized over all future proofs) - Peak RAM: 375 GB

Add PcePipeline subcommand to cuzk-bench for testing PCE memory behavior under sequential and parallel pipelining modes: - RSS tracking via /proc/self/status at each pipeline stage - malloc_trim() between proofs for clean memory release - Wave-based parallel execution using std::thread::scope (-j N flag) - compare_old flag for A/B comparison in first iteration Update cuzk-project.md with j=2 parallel pipeline benchmark results: - 2 concurrent syntheses: 49s wall vs 71s sequential (1.45x wall speedup) - Per-proof degradation: 46-49s (vs 35.5s j=1) due to BW contention - Peak RSS: 407 GiB (2x working sets + PCE static + transient)

PCE disk persistence (raw binary format): - New cuzk-pce::disk module with save_to_disk/load_from_disk - Raw binary format (v2): 32-byte header + bulk byte dumps of CSR vectors - 5.4x faster than bincode: 9.2s load vs 49.9s (from tmpfs, 25.7 GiB) - Atomic writes (tmp + rename) to prevent corruption - Header with magic/version/dimensions for quick validation Daemon integration: - preload_pce_from_disk() called at engine startup (loads all PCE files) - extract_and_cache_pce() now saves to disk after extraction - Background PCE auto-extraction triggered after first old-path synthesis - get_pce() made public for engine-level cache checking Phase 6 design document (c2-optimization-proposal-6.md): - Slotted partition pipeline: overlap synth/GPU at partition granularity - slot_size=2 sweet spot: 41s latency (vs 69.5s batch), 54 GiB RAM (vs 136 GiB) - Steady-state throughput unchanged (35.5s/proof, synthesis-bound) - Multi-sector and multi-GPU extension paths documented Measured (RTX 5070 Ti, 32 GiB PoRep): - PCE save (NVMe): 22.3s, 1.2 GB/s - PCE load (tmpfs): 9.2s, 3.0 GB/s - PCE load (NVMe): ~13-15s estimated (3x faster than 47s extraction)

…esis Redesign the slotted pipeline to truly pipeline partition synthesis with GPU proving. All 10 partitions are synthesized in parallel (bounded by channel capacity), and the GPU consumes them one at a time as they arrive. Key changes: - prove_porep_c2_partitioned(): spawns one thread per partition via std::thread::scope, all run concurrently. Bounded sync_channel provides backpressure to limit live RAM. - Each partition = 1 GPU call (num_circuits=1), which gives fast b_g2_msm (~0.4s multi-threaded vs ~23s for num_circuits>=2). - ProofAssembler: indexed by partition number, supports out-of-order arrival, assembles in partition order. - synthesize_partition(): single-partition synthesis helper. - Backward-compatible prove_porep_c2_slotted() wrapper dispatches to partitioned path when slot_size < num_partitions. Benchmark results (32 GiB PoRep, 96-core Zen4, RTX 5070 Ti): max_concurrent=1: 72.0s, 71.3 GiB peak (5.42x overlap) max_concurrent=2: 72.7s, 86.8 GiB peak (5.38x overlap) max_concurrent=3: 71.9s, 86.8 GiB peak (5.37x overlap) batch-all: 62.3s, 228.5 GiB peak (no overlap) Pipelined mode uses 3.2x less RAM (71 vs 228 GiB) with only ~16% latency overhead. GPU takes ~3.8s/partition vs 25.5s batch-all total.

…ispatcher Add timeline instrumentation for waterfall visualization of the proving pipeline. Events (SYNTH_START/END, CHAN_SEND, GPU_PICKUP/START/END) are emitted as CSV to stderr with millisecond offsets from engine start, enabling precise analysis of GPU utilization and idle gaps. Add synthesis_concurrency config parameter that controls how many proofs can be synthesized simultaneously on the CPU. When synthesis takes longer than GPU proving (39s vs 27s), the GPU idles ~12s between proofs with sequential synthesis. With concurrency=2, overlapping syntheses can keep the GPU continuously fed. Implementation uses tokio::sync::Semaphore to limit concurrent synthesis tasks. When concurrency=1 (default), behavior is identical to the old sequential loop. When >1, each batch is spawned as an independent task with semaphore-guarded concurrency. Benchmark results (PoRep C2, 5-proof runs): concurrency=1: 45.3s/proof, 70.9% GPU utilization (baseline) concurrency=2, j=2: 42.2s/proof, 77.8% GPU utilization (+7%) concurrency=2, j=3: 43.1s/proof, 90.7% GPU utilization (+5%) concurrency=2, j=4: 60.2s/proof (CPU contention, regression) CPU contention between synthesis (rayon) and b_g2_msm (rayon) during GPU proving limits the improvement. Thread pool isolation is the next step.

Add configurable thread pool partitioning to reduce CPU contention when running parallel synthesis alongside GPU proving. Two independent thread pools compete for CPU cores during proving: 1. Rayon global pool — used by synthesis (bellperson, PCE SpMV) 2. C++ groth16_pool (sppark) — used by b_g2_msm and preprocessing Changes: - groth16_cuda.cu: Convert static groth16_pool to lazy initialization via std::call_once, reading CUZK_GPU_THREADS env var for pool size. This allows the Rust caller to set the env var before first GPU call. - groth16_srs.cuh: Update all pool references to use get_groth16_pool() - config.rs: Add gpus.gpu_threads field (default 0 = all CPUs) - daemon main.rs: Configure rayon global pool from synthesis.threads, set CUZK_GPU_THREADS from gpus.gpu_threads before engine start - Cargo.toml: Add rayon dependency to cuzk-daemon - cuzk.example.toml: Document thread isolation strategy Benchmark results (PoRep C2 32G, 96C/192T + RTX 5070 Ti): Baseline (sequential, no isolation): 46.1s/proof, 70.9% GPU util Parallel c=2, j=2, no isolation: 46.0s/proof, 81.9% GPU util Parallel c=2, j=2, rayon=192, gpu=32: 44.9s/proof, 76.9% GPU util Parallel c=2, j=3, rayon=192, gpu=32: 42.8s/proof (best, +7.2%) Thread isolation provides modest improvement (~2-3%). The dominant factor remains synthesis thread scalability: 2 syntheses sharing the rayon pool each get ~96 effective threads, inflating synth from 39s to 45-47s. Higher pipeline fill (j=3) is more effective than thread partitioning.

Proposal 7 replaces the thundering-herd synthesis pattern (all 10 partitions start/finish simultaneously) with a synth worker pool that processes partitions individually and feeds them to the GPU one at a time. Key design points: - 20 synth workers (configurable) each synthesize 1 partition (~29s) - Workers submit to engine GPU channel; block if full (backpressure) - GPU proves each partition with num_circuits=1 (b_g2_msm: 0.4s vs 25s) - ProofAssembler in JobTracker accumulates partitions per job_id - Cross-sector overlap: next sector's synth starts on free workers Expected impact: 42.8s/proof → ~30s/proof steady-state (GPU-limited), ~100% GPU utilization, zero inter-sector GPU idle time. ~110 net new lines of code, primarily in engine.rs.

Implement the Phase 7 architecture from c2-optimization-proposal-7.md: dispatches individual PoRep partitions as independent work units through the engine's synthesis→GPU pipeline, eliminating the thundering-herd pattern and enabling cross-sector pipelining. Key changes: - SynthesizedJob: add partition_index, total_partitions, parent_job_id fields for per-partition routing - PartitionedJobState: new struct tracking per-job ProofAssembler, accumulated timings, and failure state - PartitionWorkItem: work unit for spawn_blocking synthesis workers - JobTracker: add assemblers map for in-progress partitioned proofs - process_batch(): new Phase 7 dispatch path when partition_workers > 0 and single-sector PoRep C2 — parses C1 once, registers assembler, dispatches 10 spawn_blocking tasks gated by partition_semaphore, returns immediately (non-blocking) - GPU worker: partition-aware result routing — routes partition proofs to ProofAssembler, delivers final proof when all partitions complete, calls malloc_trim(0) after each partition to release memory - Error handling: failed flag on PartitionedJobState, synthesis/GPU failure propagation, skip work for already-failed jobs - Config: add synthesis.partition_workers (default 20), partition semaphore limiting concurrent synthesis workers - Phase 6 slotted pipeline retained as fallback (partition_workers=0, slot_size>0) - ParsedC1Output and parse_c1_output made pub for engine access - synthesize_partition made pub for engine dispatch Expected steady-state: 42.8s/proof → ~30s/proof (GPU-limited), ~100% GPU utilization, zero cross-sector GPU idle gaps. Per-partition GPU calls use num_circuits=1, making b_g2_msm 0.4s instead of 25s.

Proposal to eliminate per-partition GPU idle gaps by overlapping one worker's CPU preamble/epilogue with another worker's CUDA kernel execution. Two GPU workers per physical GPU share a fine-grained mutex that brackets only the CUDA kernel region inside generate_groth16_proofs_c. Key findings: - The static mutex in groth16_cuda.cu covers the entire function (~3.5s), but actual CUDA kernel time is ~2.1s. The remaining ~1.3s is CPU work (preprocessing, b_g2_msm, epilogue) that could overlap with the next partition's GPU execution. - The sppark semaphore_t is a counting semaphore that latches notify() before wait(), confirming safe barrier semantics for the proposed restructuring. - Recommended approach: pass mutex pointer from Rust through FFI, acquire before per-GPU thread launch, release after per-GPU thread join, leaving b_g2_msm and epilogue outside the lock. Estimated impact: GPU efficiency ~64% → ~98%, throughput ~3-10% improvement on top of Phase 7.

Narrow the C++ static mutex in generate_groth16_proofs_c to cover only the CUDA kernel region (NTT+MSM, batch additions, tail MSMs). CPU preprocessing and b_g2_msm now run outside the lock, allowing two GPU workers to interleave: one does CPU work while the other runs CUDA. Changes across 7 files (~195 lines): - groth16_cuda.cu: Remove static mutex, add std::mutex* parameter, acquire lock before per-GPU thread launch, release after per-GPU join (before prep_msm_thread join). Add create/destroy_gpu_mutex C helpers for FFI allocation. - supraseal-c2/lib.rs: Add gpu_mtx parameter to FFI decl and both generate_groth16_proof wrappers. Export alloc/free_gpu_mutex. - bellperson supraseal.rs: Add GpuMutexPtr type, SendableGpuMutex wrapper, alloc/free helpers. Thread gpu_mutex through prove_from_assignments. Legacy callers pass null (fallback mutex). - pipeline.rs: Thread GpuMutexPtr through gpu_prove(). Internal callers pass null_mut() for backward compatibility. - engine.rs: Create one C++ mutex per GPU via alloc_gpu_mutex(). Spawn gpu_workers_per_device workers per GPU (default 2), each sharing the same mutex address (as usize for Send safety). - config.rs: Add gpus.gpu_workers_per_device (default 2). Benchmark results (RTX 5070 Ti, 96-core Zen4, partition_workers=20): Single proof: 69.3s wall (GPU efficiency: 100.0% — zero idle gaps) Throughput c=5 j=3: 44.0s/proof (Phase 7: 50.7s → 13.2% improvement) Throughput c=5 j=2: 49.5s/proof (Phase 7: 59.8s → 17.2% improvement) partition_workers=30 regresses to 60.4s/proof due to CPU contention from 30 simultaneous synthesis workers starving GPU preprocessing.

Document three new phases of the pipelined SNARK proving engine: - Phase 6: Pipelined partition proving (slot-based, 62x b_g2_msm speedup) - Phase 7: Engine-level per-partition pipeline (cross-sector overlap) - Phase 8: Dual-worker GPU interlock (100% GPU utilization) Key benchmark findings: - Optimal partition_workers=10-12 on 96-core machine (43.5s/proof → 37.4s) - System is perfectly GPU-bound: throughput = serial CUDA kernel time (10 partitions × 3.75s = 37.5s vs measured 37.4s/proof) - Cross-sector GPU transitions are seamless (<50ms after warmup) - synthesis_concurrency>1 provides no benefit (synthesis already overlapped) Update file references and related documents for Phases 6-8.

Two changes to reduce GPU SM idle time caused by PCIe transfers inside the GPU mutex: 1. Pre-stage a/b/c polynomials (6 GiB) outside the mutex via cudaHostRegister + async upload on a dedicated copy stream. Overlaps with the other worker's CUDA kernels. 2. Deferred batch sync in Pippenger MSM: double-buffer host-side bucket results so GPU never waits for CPU to process the previous batch. Eliminates 8+ per-batch idle gaps per MSM. Includes full PCIe transfer inventory (23.6 GiB HtoD per partition) and expected 4-9% throughput improvement over Phase 8.

…uploads - Pre-stage a/b/c polynomial uploads using cudaHostRegister + async DMA before GPU mutex acquisition (host pinning) and after (device alloc + upload) - Memory-aware allocation: query cudaMemGetInfo after pool trim, only pre-stage if full 12 GiB (d_a + d_bc) fits with 512 MiB safety margin - Double-buffered deferred batch sync in Pippenger MSM (sppark submodule): per-batch sync deferred to next iteration, overlapping DtoH with compute - Early d_bc free inside per_gpu thread after NTT phase completes - GPU resources cleaned up before mutex release, host pages unregistered after Results (gw=1, pw=10, c=3, j=1): - 32.1s/proof avg (14.2% improvement over Phase 8 baseline 37.4s) - ntt_msm_h_ms: 2430ms -> 690ms (-71.6%) - gpu_total_ms: 3746ms -> 1450ms (-61.3%) gw=2 shows regression (41.0s) due to cudaDeviceSynchronize + pool trim serialization — needs further investigation.

Add per-stage timing to prestage setup: sync_ms, trim_ms, alloc_ms, upload_ms. Key findings with c=15 j=15 gw=1: - Pre-staging overhead: 18ms avg (negligible - PCIe gen5 is fast) - GPU kernels: 1824ms avg/partition - CPU critical path (prep_msm + b_g2_msm): 2393ms avg/partition - CPU is the bottleneck, not GPU — DDR5 bandwidth wall with 10 concurrent synthesis workers competing for memory - Throughput: 41.3s/proof (steady-state) - c=30 j=20 causes OOM/crash from memory pressure

Phase 9 cuts GPU kernel time 51% (3.7s→1.8s/partition) but steady-state throughput only improves 14% (37.4→32.1s in isolation) because CPU preprocessing (prep_msm + b_g2_msm = 2.4s/partition) is now the critical path. At high concurrency, 10 synthesis workers saturate 8-channel DDR5 bandwidth, slowing CPU MSM operations 12-27% and limiting throughput to ~41s/proof.

Phase 10 (two-lock GPU interlock) was implemented, tested, and abandoned: - 16 GB VRAM too small for 2 workers' pre-staged buffers - CUDA memory APIs are device-global, serializing across streams - Phase 9 already hides b_g2_msm behind GPU lock release Phase 11 design spec identifies 3 sources of throughput degradation (32.1s isolation → 38.0s at c=20 j=15) and proposes 3 interventions: 1. Serialize async_dealloc to bound TLB shootdown storms 2. Reduce groth16_pool to 32 threads to cut L3 thrashing 3. Memory-bandwidth throttle during b_g2_msm via shared atomic Also reverts groth16_cuda.cu Phase 10 timing instrumentation back to Phase 9 state.

Three interventions to reduce CPU memory subsystem contention at high concurrency (c=20 j=15): 1. Serialize async_dealloc threads (static mutex in C++ and Rust) to prevent concurrent munmap() TLB shootdown storms. Alone: negligible. 2. Reduce groth16_pool from 192 to 32 threads (gpu_threads=32 config). Cuts b_g2_msm L3 cache footprint from ~1.1 GiB to ~192 MB. b_g2_msm slows from 0.5s to 1.7s but runs outside GPU lock. Best result: 36.7s/proof (3.4% improvement over Phase 9 baseline of 38.0s). 3. Memory-bandwidth throttle: global AtomicI32 flag set by C++ around b_g2_msm, checked by Rust SpMV every 64 chunks with yield_now(). No additional gain over Intervention 2 alone. Also tested gw=3 (37.2s) and gw=4 (37.4s) — both worse due to CPU contention from additional GPU workers. Optimal config: gw=2, pw=10, gpu_threads=32 → 36.7s/proof.

Decouple b_g2_msm CPU computation from the GPU worker loop so the GPU worker can pick up the next synthesized partition ~1.7s faster. The C++ generate_groth16_proofs_c is refactored into start (returns pending handle after GPU lock release) + finalize (joins b_g2_msm, runs epilogue). GPU workers spawn a separate tokio finalizer task and immediately loop back for the next job. Key changes: - C++ groth16_pending_proof struct holds all shared state on the heap - generate_groth16_proofs_start_c / finalize_groth16_proof_c split API - Fix use-after-free: prep_msm_thread now reads provers_owned (heap copy) instead of the stack parameter that goes out of scope - Rust FFI: start_groth16_proof, finish_groth16_proof, drop_pending_proof - Bellperson: PendingProofHandle<E>, prove_start(), finish_pending_proof() - Pipeline: gpu_prove_start() / gpu_prove_finish(), PendingGpuProof alias - Engine: GPU worker restructured with spawned finalizer task; extracted process_partition_result() and process_monolithic_result() helpers - SynthesisCapacityHint struct added (was referenced but undefined) - Removed unused PR generic from start_groth16_proof FFI Benchmark (gw=2 pw=10 gt=32, c=20 j=15): 37.1s/proof throughput (vs 38.0s Phase 11 baseline, ~2.4% improvement).

magik6k · 2026-02-27T23:07:49Z

Do you plan to do more bench mark on different hardware?

Definitely, but already benchmarked under some pipeline constrains which should be fairly representative.

Generally CuZK is shipped here as an entirely separate deamon which you connect to a Curio node as a 100% optional addon, so generally this integration is as safe as it gets.

I'll run this on more diverse HW once we get this PR in, but I expect it to be universally faster and cheaper:

This uses less average RAM per proof execute
It lowers minimum RAM required below even pre-supraseal-c2 requirements
It optimizes some operations to the point that they are guaranteed to always be faster
Heavier pipelining guarantees that bigger hardware is much better utilized
Synth pipelines have memory-bandwidth optimizing semaphores ensuring really memory-bandwidth-expensive compute doesn't overlap

I'd be very surprised if there is any machine configuration that is worse with this.

Eventually over time we might figure out a good way to pull this deamon into the main Curio binary, but initially seems wise to gain confidence in maximally sandboxed way.

FWIW I just verified valid PoSt and PoRep proofs on Calibnet so that's already something.

Thread a gpu_index parameter through the entire proving stack (C++ -> supraseal-c2 -> bellperson -> pipeline -> engine) so that single-circuit partition proofs run on the GPU assigned to the Rust worker instead of always landing on GPU 0. Previously, the C++ code computed n_gpus = min(ngpus(), num_circuits), which for single-circuit proofs always resolved to GPU 0 via select_gpu(0). This made per-GPU mutexes ineffective on multi-GPU systems: workers assigned to different GPUs could run CUDA kernels simultaneously on GPU 0, causing proof corruption (the original shared-mutex workaround serialized everything to one GPU). Now gpu_index >= 0 pins work to that specific GPU, while -1 preserves the original multi-GPU fan-out for batched proofs. Also converts the global d_a_cache singleton to a per-GPU array to avoid thrashing when workers on different GPUs run concurrently.

magik6k · 2026-03-06T21:24:52Z

cuzk-project.md

Should clean those up, potentially we do want those around somewhere in code docs as knowledge bases / additional context mostly for Agents to use?

snadrus · 2026-03-07T00:51:32Z

Doc is good in any form Andrew Jackson President of Curio Storage, Inc. <http://curiostorage.org>

…

On Fri, Mar 6, 2026, 3:25 PM Łukasz Magiera ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ On cuzk-project.md <#1043 (comment)> : Should clean those up, potentially we do want those around somewhere in code docs as knowledge bases / additional context mostly for Agents to use? — Reply to this email directly, view it on GitHub <#1043 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAOU4LQY2GTAE2EHWLDQECT4PM63ZAVCNFSM6AAAAACV3AWTK2VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTSMBWGEYTGNJVGA> . You are receiving this because your review was requested.Message ID: ***@***.***>

…eanup Fix three proofshare provider bugs: 1. CreateWorkAsk deadlock: When the service returns HTTP 429 (TooManyRequests), CreateWorkAsk retried forever, blocking the Do() loop from polling for work matched to existing asks. This created a permanent deadlock where no work could be inserted into proofshare_queue. Now returns ErrTooManyRequests immediately; the caller applies exponential backoff (up to 2min) on no-progress iterations and resets to 3s when progress is made. 2. cuzk job_id collision: PSProve PoRep RequestId was fmt.Sprintf("ps-porep-%d-%d", miner, sector) which is identical for all concurrent proofshare challenges (all target miner=1000, sector=1). This caused the cuzk engine's partition assembler to mix results from different proofs, producing 0/10 valid partitions. Now includes taskID for uniqueness. 3. Queue maintenance: proofshare_queue rows with submit_done=TRUE were never deleted, causing unbounded table growth and expensive dedup SELECTs. Completed rows older than 2 days are now purged every 5 minutes. Orphaned compute tasks are reset (UPDATE SET NULL) instead of deleted so work can be re-assigned. Dedup SELECT scoped to submit_done=FALSE.

Replace the static partition_workers semaphore with a unified MemoryBudget that tracks all major memory consumers (SRS pinned, PCE heap, synthesis working set) under a single byte-level budget auto-detected from system RAM. - Add MemoryBudget and MemoryReservation with RAII partial-release support - Add PceCache (replaces static OnceLock PCE storage) with LRU eviction - Make SrsManager budget-aware with on-demand loading and eviction - Two-phase working memory release: a/b/c freed after prove_start, rest after prove_finish - Remove partition_workers, srs.preload, pinned_budget config fields - Add total_budget (auto/explicit), safety_margin, eviction_min_idle config - Backward-compatible config parsing (old fields ignored with warnings) - All 15 unit tests pass, pce-bench validation passes on real 32G data

The evictor callback runs from async budget.acquire(), so calling blocking_lock() on the tokio Mutex panics with 'Cannot block the current thread from within a runtime'. Switch to try_lock() and skip SRS eviction candidates when the mutex is held — the acquire loop retries so they'll be caught on the next iteration.

Adds a StatusTracker that records pipeline, GPU worker, and memory state as proof jobs flow through the engine. A minimal raw-TCP HTTP/1.1 server on a configurable port (daemon.status_listen) serves GET /status returning JSON snapshots at 500ms polling granularity. Tracks per-partition lifecycle (synthesizing/synth_done/gpu/done/failed), GPU worker busy/idle state, memory budget usage, SRS/PCE allocations, buffer flight counters, and aggregate completion stats. Completed jobs are garbage-collected after 30 seconds.

Adds a real-time cuzk engine status visualization to the vast-manager UI. Backend: /api/cuzk-status/{uuid} endpoint that SSH-tunnels to the remote instance (via ControlMaster for connection reuse) and proxies the cuzk /status JSON response. Frontend: 1.5s-interval polling panel showing memory budget gauge, synthesis concurrency, completion counters, per-partition pipeline waterfall with state-colored cells (pending/synthesizing/synth_done/ gpu/done/failed) and timing, GPU worker cards, SRS/PCE allocation list, and buffer flight counters. Cached last response avoids flash on dashboard refresh cycles.

…ation status.rs: partition_gpu_end now only clears a worker's busy state if the worker is still assigned to the same job+partition. With split GPU proving, the finalizer task may complete after the worker already picked up a new job — the unconditional clear was clobbering the new state, making workers appear permanently idle. ui.html: show 16 chars of job_id instead of 8 to avoid cutting off at the prefix boundary (e.g. "ps-snap-" was meaningless).

Replace per-partition tokio::spawn with a shared mpsc channel and synthesis worker pool that processes partitions in FIFO arrival order. This ensures earlier jobs' partitions are synthesized and GPU-proved before later jobs, preventing all pipelines from stalling together. The synthesis worker pool is unified — handles both PoRep and SnapDeals via ParsedProofInput match, eliminating ~300 lines of duplicated synthesis/error-handling logic. Also compute synthesis.max_concurrent dynamically from the memory budget (total_bytes / smallest_partition_size) instead of using the static synthesis_concurrency config value, which was misleading (showed 4 when actual budget-gated concurrency could be 44). Tested: 2 concurrent PoRep C2 proofs completed (0.485 proofs/min), followed by live SnapDeals processing with correct budget gating.

Replace FIFO mpsc channels with BTreeMap-based priority queues for both the synthesis work queue and the GPU proving queue. Items are keyed on (job_seq, partition_idx) where job_seq is a monotonically increasing counter assigned at pipeline dispatch time. This ensures both synthesis workers and GPU workers always pick the lowest partition in the oldest pipeline, completing jobs sequentially rather than interleaving partitions randomly across pipelines. Before: all partitions from concurrent jobs raced on Notify-based budget acquire, causing random GPU assignment (e.g., Job A P0 and Job B P5 on GPU simultaneously). Result: all pipelines stalled together at 0.485 proofs/min. After: Job A completes fully before Job B starts GPU work. Measured 0.602 proofs/min (24% improvement) — Job A finishes in 114s without contention, vs ~245s when interleaved.

Split the synthesis worker pool into two stages: 1. A single dispatcher task that serializes pop + budget acquire 2. Worker pool that receives (item, reservation) via bounded channel Previously, N workers each popped an item from the priority queue then raced for budget — budget went to whichever worker's acquire() completed first, causing out-of-order synthesis (e.g., P4 before P0). Now the single dispatcher ensures budget is allocated in strict (job_seq, partition_idx) order: lowest partition in oldest pipeline always gets budget first. The bounded channel preserves this order into the worker pool.

Add support for CUDA-pinned memory backing in ProvingAssignment to enable fast H2D transfers at PCIe line rate (~50 GB/s) instead of going through CUDA's internal bounce buffer (~1-4 GB/s). - PinnedBacking struct tracks raw pinned ptrs and pool return callback - PinnedReturnFn type for returning buffers to a pool on release - new_with_pinned() constructor creates a/b/c via Vec::from_raw_parts - release_abc() mem::forgets pinned Vecs and calls return callback - Drop impl ensures pinned buffers are returned even without explicit release - prove_start uses release_abc() instead of manual Vec replacement - synthesize_circuits_batch_with_prover_factory() accepts closure for custom prover creation, enabling callers to inject pinned-backed provers

Fix GPU underutilization caused by slow H2D PCIe transfers from unpinned host memory. CUDA's cudaMemcpyAsync from unpinned memory goes through a small internal bounce buffer at ~1-4 GB/s instead of PCIe Gen5 line rate (~50 GB/s), causing 2-14s NTT stalls per partition. PinnedPool (pinned_pool.rs): - Pool of CUDA pinned buffers (cudaHostAlloc/cudaFreeHost) - checkout/checkin with size-aware reuse (returns smallest fitting buffer) - PinnedAbcBuffers: atomically checks out 3 buffers for a/b/c vectors - Not budget-integrated: pinned memory replaces heap a/b/c allocations Reactive dispatch throttle (engine.rs): - Semaphore-based 1:1 modulation via max_gpu_queue_depth config - Dispatcher acquires permit before starting each synthesis - GPU finalizer releases permit after prove_finish completes - Prevents burst dispatch that caused cudaHostAlloc serialization stalls and pinned pool thrashing (474 allocs / 12 reuses -> 24 allocs / 48 reuses) Pipeline integration (pipeline.rs): - synthesize_with_hint wires PinnedPool into prover factory closure - Graceful fallback to unpinned on pool exhaustion with warning log C++ timing instrumentation (groth16_cuda.cu): - mutex_wait_ms, barrier_wait_ms, mutex_held_ms for GPU pipeline profiling Results: NTT+H2D dropped from 2-14s to 0ms per partition, total GPU time per partition dropped from 8-19s to ~950ms, budget freed for PCE caching.

Replace the burst-based P-controller with a continuous pacer that regulates synthesis dispatch rate using PI control with GPU rate feed-forward. DispatchPacer: - Feed-forward: EMA of GPU inter-completion interval (measured via atomic completion counter incremented by GPU finalizers) - Feedback: PI correction on (target - EMA_waiting), where EMA_waiting is a smoothed gpu_work_queue.len() - Conservative gains (Kp=0.1, Ki=0.008) for the 20-60s synthesis delay - Output: dispatch interval clamped to [50ms, 60s] Bootstrap phase: - Dispatches target items at 200ms spacing to prime the pipeline - Then waits for the first GPU completion to calibrate the GPU rate EMA - Switches to PI control once calibrated Steady state: - Dispatches one item per timer tick at the PI-computed interval - GPU events update pacer state but don't directly trigger dispatch - Converges to matching GPU consumption rate with target items waiting - Periodic status logging every 5 dispatches This replaces the previous burst dispatch which caused: - Pinned pool exhaustion -> cudaHostAlloc -> GPU driver serialization - GPU timing jitter -> pacing instability -> more burst dispatch The steady dispatch rate keeps concurrent synthesis count stable, so the pinned pool stays within its warm allocation.

…ynth cap, re-bootstrap Major pacer redesign to fix three collapse modes: 1. GPU rate measurement: use actual GPU worker processing durations (atomically accumulated) instead of inter-completion intervals. The old approach included idle time when the queue drained, creating a self-reinforcing collapse. With N interleaved workers, effective interval = processing_time / N. 2. Remove synthesis throughput cap: the cap created a vicious cycle where slow dispatch → fewer concurrent synths → slower throughput → tighter cap → even slower dispatch. Memory budget provides the correct backpressure via budget.acquire() blocking. 3. Re-bootstrap on pipeline drain: when ema_waiting < 1.0, re-enter bootstrap to refill the pipe. Integral resets (stale from previous batch), GPU rate EMA preserved (hardware characteristic). 4. Slow bootstrap from 200ms to 3s initial / max(2s, gpu_eff) for re-bootstrap. Fast bootstrap flooded the pinned pool with concurrent cudaHostAlloc calls that serialized through the GPU driver, stalling all GPU activity.

PI tuning: - Normalize error by target: (target - waiting) / target gives [-2, +1] range regardless of target value. Gains are now target-independent. - kp 0.1 → 0.5: P does the heavy lifting on normalized error. Half-empty queue → 25% faster, empty → 50% faster, overfull → 50% slower. - ki 0.008 → 0.02: gentle drift correction. - Asymmetric integral clamp: +2.0 / -0.5 instead of ±20.0. Negative integral (slow down) is heavily restricted — aggressive backoff was draining the entire pipeline after memory ceiling slams. - rate_mult clamp [0.1, 5.0] → [0.3, 3.0]: at most 3.3x slowdown. Re-bootstrap fix: - Old condition: ema_waiting < 1.0 (GPU queue low). This triggered re-bootstrap spam while items were still in the synthesis pipeline (30-60s to reach GPU queue). 42+ re-bootstraps in minutes. - New condition: also require total_dispatched <= gpu_completions (pipeline truly empty — nothing in synthesis or GPU queue). Only re-bootstraps between actual proof batches. Add in_flight metric to status log (dispatched - gpu_completions) for pipeline depth visibility.

Hard cap on concurrent synthesis workers in the pipeline path. Too many concurrent syntheses causes CPU contention and DDR5 memory bandwidth saturation, making each synthesis slower and reducing overall throughput. 18 is a good default for 64-core DDR5 systems. Config: max_parallel_synthesis (default 18, 0 = use default). Applied as: synth_worker_count = min(budget_partitions, max_parallel).

snadrus

The Go integration works, but depends on the admin running CuZK ahead of time and on the same server (or else scheduling which still runs gets sideways).
Can we limit addresses to localhost?

magik6k added 30 commits February 17, 2026 16:38

magik6k added 2 commits February 27, 2026 23:49

cuzk: wire post

55dfe5a

fix sppark dep

0001b58

make gen

a72bfa6

magik6k marked this pull request as ready for review February 27, 2026 23:08

magik6k requested a review from a team as a code owner February 27, 2026 23:08

filecoin-project deleted a comment from cursor bot Feb 27, 2026

magik6k added 5 commits March 1, 2026 14:27

cuzk: fix snap

e8a1a38

cuzk: Fix post/snap pce

782bb55

add protobufc to docs

63c7b67

fix porep on multi-gpu

85aa1ca

magik6k commented Mar 6, 2026

View reviewed changes

magik6k added 15 commits March 13, 2026 11:45

snadrus reviewed Mar 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

10x cheaper C2/PRU: CuZK Proving engine#1043

10x cheaper C2/PRU: CuZK Proving engine#1043
magik6k wants to merge 60 commits intomainfrom
feat/cuzk

magik6k commented Feb 20, 2026

Uh oh!

magik6k commented Feb 27, 2026

Uh oh!

magik6k Mar 6, 2026

Uh oh!

snadrus commented Mar 7, 2026 via email

Uh oh!

snadrus left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

magik6k commented Feb 20, 2026

Summary

What is cuzk

Architecture

Pipelining

1. Partition-Level Synthesis → GPU Pipeline (Phase 7)

2. Dual-Worker GPU Interlock (Phase 8)

3. Split Async GPU API (Phase 12)

Memory Management

SRS Residency

Per-Partition Working Set

Backpressure

CPU Locking / GPU Mutex

Task Integration Details

Build

Files Changed

Uh oh!

magik6k commented Feb 27, 2026

Uh oh!

magik6k Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

snadrus commented Mar 7, 2026 via email

Uh oh!

snadrus left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants