Skip to content

10x cheaper C2/PRU: CuZK Proving engine#1043

Open
magik6k wants to merge 60 commits intomainfrom
feat/cuzk
Open

10x cheaper C2/PRU: CuZK Proving engine#1043
magik6k wants to merge 60 commits intomainfrom
feat/cuzk

Conversation

@magik6k
Copy link
Copy Markdown
Collaborator

@magik6k magik6k commented Feb 20, 2026

Summary

Integrate the cuzk persistent GPU SNARK proving daemon with Curio's task scheduler via gRPC. When enabled, Curio delegates PoRep C2, SnapDeals prove, and PSProve SNARK computations to the cuzk daemon instead of spawning per-proof child processes through ffiselect.

  • Add gRPC client (lib/cuzk/) and SealCalls methods (lib/ffi/cuzk_funcs.go) for PoRep and SnapDeals
  • Wire cuzk into PoRep, SnapDeals, and PSProve tasks with backpressure via GetStatus
  • Add make cuzk build target (NOT in default BINS — CI unaffected)
  • Vendor bellpepper-core and supraseal-c2 crate files so git clone && make cuzk works
  • Add user-facing documentation under documentation/en/experimental-features/

What is cuzk

cuzk is a persistent Rust daemon that keeps Groth16 SRS parameters (~47 GiB for 32 GiB PoRep) resident in CUDA-pinned host memory across proofs. The current ffiselect model spawns a fresh process per proof, loading the SRS from scratch each time (30-90s). cuzk eliminates this overhead entirely.
Beyond SRS residency, cuzk implements a 13-phase optimization pipeline that achieves 2.8x throughput over the ffiselect baseline (37.7s/proof vs ~89s on RTX 5070 Ti). The key architectural contributions are pipelined partition synthesis, dual-worker GPU interlock, PCIe transfer optimization, and a split async GPU proving API.

Architecture

Curio (Go)                          cuzk daemon (Rust/CUDA)
─────────────                       ──────────────────────
tasks/seal,snap,proofshare          persistent process
        │                                   │
    gRPC client ────── unix/TCP ──────► gRPC server
  (lib/cuzk/client.go)                     │
                                    ┌──────┴───────┐
                                    │   Scheduler   │
                                    │  (priority Q) │
                                    └──────┬───────┘
                                           │
                                    ┌──────┴───────┐
                                    │  GPU Workers  │
                                    │  (per device) │
                                    └──────────────┘

Vanilla proofs are generated locally in Curio (requires sector data on disk), then sent to cuzk for SNARK computation. The returned proof is verified locally before submission.

Pipelining

The cuzk engine pipelines work at three levels:

1. Partition-Level Synthesis → GPU Pipeline (Phase 7)

Instead of synthesizing all 10 partitions of a sector as a batch before any GPU work begins (the ffiselect model), cuzk spawns partition_workers concurrent synthesis tasks. Each produces a single partition's ProvingAssignment (~13.6 GiB) and sends it through a bounded channel to the GPU worker:

Synthesis Workers:  Part 0Part 2 ...
                         │      │      │
                         ▼      ▼      ▼
GPU Channel:        ─── P0 P1 P2 ───►
                                        │
GPU Worker:                         Prove P0Prove P2 ...

The GPU processes partition N while workers synthesize partition N+1..N+k. With partition_workers=10, all 10 partitions synthesize concurrently and the GPU is continuously fed.

2. Dual-Worker GPU Interlock (Phase 8)

Inside generate_groth16_proofs_c(), each partition has ~1.3s of CPU preprocessing (pointer setup, bitmap population) before ~3.3s of CUDA kernels, followed by ~0.7s of CPU epilogue. Phase 8 narrows the C++ mutex to cover only the CUDA kernel region and runs two GPU workers per device:

Worker A: CPU prepepilogue══ CUDA ══
Worker B:          CPU prep──epilogue
GPU:               ████ A ████████ B ████

This achieves 100% GPU utilization — zero idle gaps between partitions in steady state.

3. Split Async GPU API (Phase 12)

Phase 12 splits the monolithic C++ prove call into prove_start() (GPU kernels) + finalize() (b_g2_msm CPU + proof assembly). The GPU worker releases the lock ~1.7s earlier per partition, immediately picking up the next synthesized partition while b_g2_msm runs in a spawned finalizer task:

GPU Worker: ══ CUDA ════ CUDA ══
Finalizer:        b_g2_msm  b_g2_msm  b_g2_msm

Memory Management

SRS Residency

The daemon pre-populates GROTH_PARAM_MEMORY_CACHE at startup. Since the process is long-lived, the ~47 GiB SRS stays pinned in CUDA host memory across all proofs. No per-proof disk I/O.

Per-Partition Working Set

Memory is proportional to partition_workers, not total partitions:

Pipeline stage Per-partition memory Notes
During synthesis ~16 GiB 12 GiB a/b/c + 4 GiB aux
After prove_start ~4 GiB a/b/c freed immediately, only aux + density remain
Pending finalize ~4 GiB Held by finalizer task
The formula: Peak RSS ≈ 69 + (partition_workers × 20) GiB
Validated configurations:
  • 128 GiB system: pw=2, gw=1 → 110 GiB peak, 152s/proof
  • 256 GiB system: pw=7, gw=1 → 208 GiB peak, 53s/proof
  • 512 GiB system: pw=12, gw=2 → 400 GiB peak, 37.7s/proof

Backpressure

Three mechanisms prevent OOM at high concurrency:

  1. Early a/b/c free: prove_start() clears 12 GiB/partition immediately after GPU upload
  2. Channel capacity auto-scaling: bounded to max(synthesis_lookahead, partition_workers)
  3. Partition semaphore held through send: limits total in-flight synthesis outputs

CPU Locking / GPU Mutex

The C++ generate_groth16_proofs_c() originally used a static std::mutex that serialized the entire function. cuzk introduces:

  • Heap-allocated mutex (create_gpu_mutex() / destroy_gpu_mutex() FFI): one per physical GPU, managed by the engine. Passed through FFI as *mut c_void.
  • Narrowed scope: acquired before per-GPU CUDA kernel launch, released after kernels complete but before prep_msm_thread.join() — b_g2_msm and proof assembly run outside the lock.
  • Backward compatible: if gpu_mtx is null, falls back to the function-local static mutex (for non-engine callers).
    The dual-worker interlock (2 workers per GPU) alternates lock acquisition so Worker B's CPU prep runs while Worker A holds the lock for CUDA kernels, and vice versa.

Task Integration Details

When [Cuzk] Address is set in Curio config:

Behavior Change
TypeDetails() GPU requirement zeroed, RAM set to 1 GiB (resource decisions delegated to cuzk)
CanAccept() Queries GetStatus → rejects if totalPending >= MaxPending
Do() Generates vanilla proof locally → sends to cuzk via Prove RPC → verifies returned proof locally
When Address is empty (default), all tasks behave exactly as before. No behavioral change for existing deployments.

Build

make cuzk          # builds extern/cuzk → ./cuzk binary (~1m51s from scratch)
make install-cuzk  # installs to /usr/local/bin
make clean         # includes cargo clean in extern/cuzk
make cuzk is NOT in the default BINS or BUILD_DEPS targets, so CI (which has no CUDA) is unaffected. Requires nvcc and cargo.

Files Changed

New files:

  • lib/cuzk/client.go — gRPC client wrapper (connect, Prove, GetStatus, HasCapacity)
  • lib/cuzk/proving.pb.go, proving_grpc.pb.go — generated protobuf/gRPC stubs
  • lib/ffi/cuzk_funcs.go — PoRepSnarkCuzk, ProveUpdateCuzk on SealCalls
  • documentation/en/experimental-features/cuzk-proving-daemon.md — user guide
    Modified files:
  • deps/config/types.go — CuzkConfig struct + defaults
  • cmd/curio/tasks/tasks.go — creates cuzk.Client, passes to task constructors
  • tasks/seal/task_porep.go — cuzkClient field, Do/CanAccept/TypeDetails branches
  • tasks/snap/task_prove.go — same pattern
  • tasks/proofshare/task_prove.go — same + threaded through computeProof→computePoRep/computeSnap
  • Makefile — cuzk build/install/clean targets
  • .gitignore — /cuzk binary
    Vendored crate files (for git clone && make cuzk):
  • extern/bellpepper-core/ — 13 files (full crate: Cargo.toml, src/, licenses)
  • extern/supraseal-c2/ — 8 files (Cargo.toml, build.rs, Cargo.lock, tests)

Implement the cuzk proving engine as a Rust workspace in extern/cuzk/
with 5 crates (proto, core, server, daemon, bench) and full gRPC API.

Phase 0 delivers:
- gRPC daemon (TCP + Unix socket) with 8 RPC endpoints
- Real PoRep C2 proving via filecoin-proofs-api + SupraSeal CUDA backend
- SRS parameter residency via GROTH_PARAM_MEMORY_CACHE (lazy populate)
- Priority scheduler with binary heap queue
- Prometheus metrics endpoint
- Bench tool for single proof submission, status, preload, metrics

E2E validated: Two consecutive 32GiB PoRep C2 proofs on RTX 5070 Ti —
116.8s cold (SRS from disk) → 92.8s warm (SRS cached), 20.5% improvement.
Both produced valid 1920-byte Groth16 proofs.
…f fix

Improve the cuzk daemon's debuggability and operational readiness
for Phase 1 multi-GPU work:

Observability:
- Add tracing spans (info_span) with job_id correlation throughout
  prover and engine; upstream filecoin-proofs logs now tagged per-job
- Split timing into deserialize vs proving (monolithic in Phase 0)
- Per proof-kind Prometheus counters and duration summaries
- GPU detection via nvidia-smi in GetStatus RPC (name, VRAM)
- Running job info shown in status and annotated on GPU

Correctness:
- Fix AwaitProof to register late listeners (was broken, always 404)
- Graceful shutdown via watch channel (drain, finish current proof)
- Per-kind completed/failed counters with ring buffer for durations

Tooling:
- Add 'batch' command to cuzk-bench (sequential + concurrent modes,
  throughput stats with avg/min/max/proofs-per-min)
- Refactor bench client connection into shared connect() helper
- Add cuzk.example.toml with documented configuration

E2E validated: 32GiB PoRep C2 proof completes in ~110s with full
job_id-correlated logging and per-kind metrics.
…heduling

Wire up WinningPoSt, WindowPoSt, and SnapDeals provers via filecoin-proofs-api:
- prove_winning_post: generate_winning_post_with_vanilla
- prove_window_post: generate_single_window_post_with_vanilla (per-partition)
- prove_snap_deals: generate_empty_sector_update_proof_with_vanilla

Multi-GPU worker pool:
- Auto-detect GPUs via nvidia-smi or use config gpus.devices list
- Spawn one async worker loop per GPU with CUDA_VISIBLE_DEVICES isolation
- Per-worker SRS affinity tracking (last_circuit_id for future routing)

Proto/API updates:
- Add repeated bytes vanilla_proofs field for PoSt/SnapDeals multi-proof inputs
- Rename SnapDeals fields to comm_r_old/comm_r_new/comm_d_new (raw 32-byte)
- Registered proof type enum conversion (FFI V1_1 ↔ proofs-api V1_2 mapping)

Bench tool updated:
- Supports all proof types with --vanilla (JSON array of base64 proofs)
- New flags: --registered-proof, --randomness, --comm-r-old/new, --comm-d-new

8 unit tests pass, 0 warnings, clean cargo check --no-default-features.
…napDeals

Add gen-vanilla subcommand to cuzk-bench for generating vanilla proof test
data from existing sealed sector data. This completes Phase 1 by enabling
end-to-end testing of all four proof types (WinningPoSt, WindowPoSt,
SnapDeals) without requiring Go/Curio.

Three sub-subcommands:
- winning-post: challenge selection + Merkle inclusion proofs (66 challenges)
- window-post: fallback challenges + vanilla proofs (10 challenges)
- snap-prove: partition proofs from original + updated sector data (16 partitions)

Key implementation details:
- filecoin-proofs-api added as optional dep behind 'gen-vanilla' feature flag
- CID commitment parsing via cid crate (bagboea4b5abc... → [u8;32])
- commdr.txt file format parsing (d:<CID> r:<CID>)
- Output format: JSON array of base64 strings (matches Go json.Marshal([][]byte))
- CPU-only, no GPU required (--no-default-features --features gen-vanilla)

Validated against /data/32gbench/ golden data:
- WinningPoSt: 164KB vanilla proof, 218KB JSON output
- WindowPoSt: 25KB vanilla proof, 33KB JSON output
- SnapDeals: 16 × 562KB partition proofs, 12MB JSON output

5 new unit tests (CID parsing, commdr format, JSON round-trip).
Fork bellperson 0.26.0 into extern/bellperson/ with minimal changes to
expose the synthesis/GPU split point for pipelined proving:

bellperson changes (3 files, ~130 lines changed):
- prover/mod.rs: Make ProvingAssignment struct and all fields pub
- prover/supraseal.rs: Make synthesize_circuits_batch() pub, add new
  prove_from_assignments() function (extracted GPU-phase code)
- groth16/mod.rs: Re-export ProvingAssignment, synthesize_circuits_batch,
  prove_from_assignments under cuda-supraseal feature

The internal two-phase architecture was already clean — synthesis runs
circuit.synthesize() on CPU (rayon parallel), producing ProvingAssignment
with a/b/c evaluation vectors + density trackers. GPU phase packs these
into raw pointer arrays and calls supraseal_c2::generate_groth16_proof().
We simply expose both phases as separate public functions.

cuzk workspace changes:
- Cargo.toml: Add [patch.crates-io] for bellperson fork, add bellperson
  as workspace dependency
- Cargo.lock: Updated to use local bellperson

Also includes cuzk-phase2-design.md with complete Phase 2 design:
- Per-partition pipeline strategy (13.6 GiB intermediate state instead of
  136 GiB for all 10 partitions)
- Memory budget analysis for 128 GiB vs 256 GiB machines
- SRS manager design using SuprasealParameters directly
- 7-step implementation plan
- Call chain comparison (Phase 1 monolithic vs Phase 2 pipelined)

All 8 existing cuzk tests pass. Zero new warnings from our changes.
Implement the core Phase 2 infrastructure: split monolithic seal_commit_phase2()
into separate CPU synthesis and GPU proving phases, connected via a pipeline.

New modules:
- srs_manager.rs: Direct SRS loading via SuprasealParameters (bypasses
  GROTH_PARAM_MEMORY_CACHE). CircuitId enum maps proof types to exact
  .params filenames. Supports preload, evict, memory budget tracking.

- pipeline.rs: Per-partition pipelined PoRep C2 proving. Each of the 10
  partitions is synthesized individually (~13.6 GiB intermediate state vs
  ~136 GiB for all 10 at once), then proven on GPU via bellperson's split
  API (synthesize_circuits_batch → prove_from_assignments).
  Enables PoRep pipelining on 128 GiB machines.

Engine changes:
- Engine now supports pipeline.enabled config flag
- When enabled, PoRep C2 jobs use pipelined prover with SrsManager
- When disabled, falls back to Phase 1 monolithic prover
- SRS preloading uses SrsManager in pipeline mode

Config additions:
- [pipeline] section: enabled, synthesis_lookahead
- synthesis_lookahead controls backpressure (partitions buffered)

Dependencies:
- Added direct deps on filecoin-proofs, storage-proofs-{core,porep,post,update},
  bellperson (fork), blstrs, ff, rayon, rand_core, filecoin-hashers
- Correct feature flag propagation (cuda-supraseal for core+bellperson,
  cuda for porep/post/update which lack cuda-supraseal)

Tests: 15 pass (12 existing + 3 new), 0 warnings from cuzk code.
Compiles with --no-default-features (no GPU required for check builds).
Rewrite pipeline.rs to use batch synthesis (all 10 PoRep partitions in
one rayon-parallel call + single GPU pass) instead of per-partition
sequential mode. This matches monolithic performance (~91s vs ~93s)
while enabling cross-proof overlap in the next step.

Add pipelined synthesis/prove functions for all 4 proof types:
- PoRep C2: batch mode (synthesize_porep_c2_batch + gpu_prove)
- WinningPoSt: inlined circuit construction (no private API needed)
- WindowPoSt: single-partition inlined circuit construction
- SnapDeals: all-partition circuit construction

Other changes:
- engine.rs: route all proof types through pipeline when enabled
- prover.rs: make 4 helper functions pub for pipeline.rs use
- Add bincode dep for PoSt/SnapDeals vanilla proof deserialization
Restructure the engine to use a two-stage pipeline architecture when
pipeline mode is enabled:

  Stage 1 (synthesis task): Pulls requests from the scheduler, runs
  CPU-bound circuit synthesis on a blocking thread, pushes the
  SynthesizedJob (intermediate state + SRS ref) to a bounded channel.

  Stage 2 (GPU workers): One per GPU, pull SynthesizedJob from the
  shared channel, run gpu_prove on a blocking thread pinned to their
  GPU via CUDA_VISIBLE_DEVICES, complete the job.

The bounded channel (capacity = synthesis_lookahead config, default 1)
provides backpressure: when GPU workers are busy and the channel is
full, the synthesis task blocks — preventing OOM from unbounded
pre-synthesized proofs.

For PoRep 32G under continuous load, this enables:
  synth(N) | GPU(N) + synth(N+1) | GPU(N+1) + synth(N+2) | ...
  Steady-state: ~55s/proof (synthesis-bound) vs ~91s sequential

When pipeline.enabled = false, falls back to Phase 1 monolithic
workers (no overlap, full cycle per GPU worker).

Also updates the example config with improved pipeline documentation.
Add batch collector and multi-sector synthesis to the pipeline engine.
When max_batch_size > 1, same-type PoRep requests are accumulated and
processed as a single combined synthesis + GPU proving pass, amortizing
fixed GPU costs and improving SM utilization.

New files:
- batch_collector.rs: Accumulates same-circuit-type proof requests,
  flushes on max_batch_size or max_batch_wait_ms timeout. PoRep and
  SnapDeals are batchable; PoSt types bypass the collector entirely.

Pipeline changes:
- synthesize_porep_c2_multi(): Takes N sectors' C1 outputs, builds all
  N×10 partition circuits, synthesizes in one batch call. Returns
  combined SynthesizedProof + sector_boundaries for splitting results.
- split_batched_proofs(): Splits concatenated GPU output back into
  per-sector proof byte vectors using sector_boundaries.

Engine changes:
- Synthesis task now uses BatchCollector for batchable proof types.
  Races scheduler delivery against batch timeout. Non-batchable types
  (WinningPost, WindowPost) preempt-flush any pending batch and process
  immediately.
- SynthesizedJob extended with batch_requests and sector_boundaries.
- GPU worker handles batched results: splits proof output, notifies
  each sector's individual caller with its own proof bytes and timings.

Config:
- scheduler.max_batch_size controls batch limit (1=disabled, 2-3 typical)
- scheduler.max_batch_wait_ms controls accumulation window

Backward compatible: max_batch_size=1 (default) preserves Phase 2
single-sector behavior exactly. All 25 tests pass, 0 cuzk warnings.
…oughput

All Phase 3 E2E tests pass on RTX 5070 Ti:
- Timeout flush: BatchCollector correctly flushes after 30s wait
- Batch=2: 2 sectors synthesized as 20 circuits in 55s (same as 10),
  GPU 69s, yielding 62.7s/proof (1.42x vs baseline 89s)
- Overflow: 3 proofs with batch=2 shows correct batch+overflow+pipeline
- Non-batchable: WinningPoSt bypasses BatchCollector (0.8s total)

Memory: batch=2 peaks at 360 GiB (vs 203 GiB for single proof).
Updated roadmap table with measured numbers.
Synthesis optimizations (55.4s → 50.9s, -8.3%):
- Boolean::add_to_lc/sub_from_lc: eliminate temporary LC allocations in
  circuit gadget hot paths (Boolean::lc creates a fresh Vec on every call;
  the new methods append directly to an existing LC)
- Patched: UInt32::addmany, Num::add_bool_with_coeff, Boolean::enforce_equal,
  Boolean::sha256_ch, Boolean::sha256_maj, lookup3_xy,
  lookup3_xy_with_conditional_negation
- Vec recycling pool in ProvingAssignment::enforce for the 6 LC buffers
- Software prefetch in eval_with_trackers and LinearCombination::eval
- perf stat: 91B fewer instructions (-15.3%), 18.6B fewer branches (-26.7%)

GPU async deallocation (36s → 26s bellperson wrapper, -10s):
- Root cause: ~37 GB of C++ vectors (split_vectors, tail_msm_bases) and
  ~130 GB of Rust Vecs (ProvingAssignment a/b/c) freed synchronously in
  destructors after GPU proving, blocking return for ~10s of munmap() calls
- C++ fix: move split_vectors + tail_msm bases into detached std::thread
- Rust fix: spawn thread to drop provers/input_assignments/aux_assignments
- CUDA internal timing unchanged (~26s); overhead was pure deallocation

Also: A4 (parallel B_G2 CPU MSM), D4 (per-MSM window objects),
CUDA timing instrumentation, synth-only microbenchmark tool.

E2E 32 GiB PoRep C2 on RTX 5070 Ti: 88.9s → 77.2s (-13.2%)
Pre-allocate ProvingAssignment Vecs (a, b, c, aux_assignment) to their
final capacity using hints cached from the first synthesis. Eliminates
~27 reallocation cycles per Vec per circuit.

Benchmarked: no measurable impact on 32 GiB PoRep C2 (50.65s with and
without hints). Rust's geometric doubling amortizes well at our scale,
and the ~265 GB of theoretical redundant copies are overlapped with
computation across 10 parallel circuits on 96 cores. Kept as defensive
code for memory-constrained environments.
Replace full circuit synthesis (alloc+enforce) with two-phase approach:
1. WitnessCS: witness-only generation (enforce is no-op)
2. CSR MatVec: pre-compiled sparse matrix × witness vector

New cuzk-pce crate with:
- RecordingCS: captures R1CS structure into CSR format (with tagged
  column encoding to handle interleaved alloc_input/enforce)
- CsrMatrix/PreCompiledCircuit: serializable CSR storage
- spmv_parallel: row-parallel sparse MatVec with rayon
- evaluate_pce: builds witness vector, evaluates A*w, B*w, C*w
- PreComputedDensity: density bitmaps extracted from CSR structure

Pipeline integration:
- synthesize_auto() dispatcher: PCE fast path when cached, old path otherwise
- Static OnceLock caches per circuit type (porep-32g, winning-post, etc.)
- ProvingAssignment::from_pce() constructor in bellperson fork
- All 6 synthesis call sites switched to synthesize_auto()

Benchmark (pce-bench subcommand):
- Correctness: all 10 circuits × 130M constraints match bit-for-bit
- Baseline synthesis: 50.4s (10 circuits, old path)
- PCE synthesis:     35.5s (26.5s witness + 8.8s MatVec)
- Speedup:           1.42x
- PCE extraction:    46.9s (one-time cost, amortized over all future proofs)
- Peak RAM:          375 GB
Add PcePipeline subcommand to cuzk-bench for testing PCE memory behavior
under sequential and parallel pipelining modes:
- RSS tracking via /proc/self/status at each pipeline stage
- malloc_trim() between proofs for clean memory release
- Wave-based parallel execution using std::thread::scope (-j N flag)
- compare_old flag for A/B comparison in first iteration

Update cuzk-project.md with j=2 parallel pipeline benchmark results:
- 2 concurrent syntheses: 49s wall vs 71s sequential (1.45x wall speedup)
- Per-proof degradation: 46-49s (vs 35.5s j=1) due to BW contention
- Peak RSS: 407 GiB (2x working sets + PCE static + transient)
PCE disk persistence (raw binary format):
- New cuzk-pce::disk module with save_to_disk/load_from_disk
- Raw binary format (v2): 32-byte header + bulk byte dumps of CSR vectors
- 5.4x faster than bincode: 9.2s load vs 49.9s (from tmpfs, 25.7 GiB)
- Atomic writes (tmp + rename) to prevent corruption
- Header with magic/version/dimensions for quick validation

Daemon integration:
- preload_pce_from_disk() called at engine startup (loads all PCE files)
- extract_and_cache_pce() now saves to disk after extraction
- Background PCE auto-extraction triggered after first old-path synthesis
- get_pce() made public for engine-level cache checking

Phase 6 design document (c2-optimization-proposal-6.md):
- Slotted partition pipeline: overlap synth/GPU at partition granularity
- slot_size=2 sweet spot: 41s latency (vs 69.5s batch), 54 GiB RAM (vs 136 GiB)
- Steady-state throughput unchanged (35.5s/proof, synthesis-bound)
- Multi-sector and multi-GPU extension paths documented

Measured (RTX 5070 Ti, 32 GiB PoRep):
- PCE save (NVMe): 22.3s, 1.2 GB/s
- PCE load (tmpfs): 9.2s, 3.0 GB/s
- PCE load (NVMe): ~13-15s estimated (3x faster than 47s extraction)
…esis

Redesign the slotted pipeline to truly pipeline partition synthesis with
GPU proving. All 10 partitions are synthesized in parallel (bounded by
channel capacity), and the GPU consumes them one at a time as they
arrive.

Key changes:
- prove_porep_c2_partitioned(): spawns one thread per partition via
  std::thread::scope, all run concurrently. Bounded sync_channel
  provides backpressure to limit live RAM.
- Each partition = 1 GPU call (num_circuits=1), which gives fast
  b_g2_msm (~0.4s multi-threaded vs ~23s for num_circuits>=2).
- ProofAssembler: indexed by partition number, supports out-of-order
  arrival, assembles in partition order.
- synthesize_partition(): single-partition synthesis helper.
- Backward-compatible prove_porep_c2_slotted() wrapper dispatches
  to partitioned path when slot_size < num_partitions.

Benchmark results (32 GiB PoRep, 96-core Zen4, RTX 5070 Ti):
  max_concurrent=1: 72.0s, 71.3 GiB peak (5.42x overlap)
  max_concurrent=2: 72.7s, 86.8 GiB peak (5.38x overlap)
  max_concurrent=3: 71.9s, 86.8 GiB peak (5.37x overlap)
  batch-all:        62.3s, 228.5 GiB peak (no overlap)

Pipelined mode uses 3.2x less RAM (71 vs 228 GiB) with only ~16%
latency overhead. GPU takes ~3.8s/partition vs 25.5s batch-all total.
…ispatcher

Add timeline instrumentation for waterfall visualization of the proving
pipeline. Events (SYNTH_START/END, CHAN_SEND, GPU_PICKUP/START/END) are
emitted as CSV to stderr with millisecond offsets from engine start,
enabling precise analysis of GPU utilization and idle gaps.

Add synthesis_concurrency config parameter that controls how many proofs
can be synthesized simultaneously on the CPU. When synthesis takes longer
than GPU proving (39s vs 27s), the GPU idles ~12s between proofs with
sequential synthesis. With concurrency=2, overlapping syntheses can keep
the GPU continuously fed.

Implementation uses tokio::sync::Semaphore to limit concurrent synthesis
tasks. When concurrency=1 (default), behavior is identical to the old
sequential loop. When >1, each batch is spawned as an independent task
with semaphore-guarded concurrency.

Benchmark results (PoRep C2, 5-proof runs):
  concurrency=1: 45.3s/proof, 70.9% GPU utilization (baseline)
  concurrency=2, j=2: 42.2s/proof, 77.8% GPU utilization (+7%)
  concurrency=2, j=3: 43.1s/proof, 90.7% GPU utilization (+5%)
  concurrency=2, j=4: 60.2s/proof (CPU contention, regression)

CPU contention between synthesis (rayon) and b_g2_msm (rayon) during GPU
proving limits the improvement. Thread pool isolation is the next step.
Add configurable thread pool partitioning to reduce CPU contention when
running parallel synthesis alongside GPU proving.

Two independent thread pools compete for CPU cores during proving:
  1. Rayon global pool — used by synthesis (bellperson, PCE SpMV)
  2. C++ groth16_pool (sppark) — used by b_g2_msm and preprocessing

Changes:
- groth16_cuda.cu: Convert static groth16_pool to lazy initialization
  via std::call_once, reading CUZK_GPU_THREADS env var for pool size.
  This allows the Rust caller to set the env var before first GPU call.
- groth16_srs.cuh: Update all pool references to use get_groth16_pool()
- config.rs: Add gpus.gpu_threads field (default 0 = all CPUs)
- daemon main.rs: Configure rayon global pool from synthesis.threads,
  set CUZK_GPU_THREADS from gpus.gpu_threads before engine start
- Cargo.toml: Add rayon dependency to cuzk-daemon
- cuzk.example.toml: Document thread isolation strategy

Benchmark results (PoRep C2 32G, 96C/192T + RTX 5070 Ti):
  Baseline (sequential, no isolation):      46.1s/proof, 70.9% GPU util
  Parallel c=2, j=2, no isolation:          46.0s/proof, 81.9% GPU util
  Parallel c=2, j=2, rayon=192, gpu=32:     44.9s/proof, 76.9% GPU util
  Parallel c=2, j=3, rayon=192, gpu=32:     42.8s/proof (best, +7.2%)

Thread isolation provides modest improvement (~2-3%). The dominant factor
remains synthesis thread scalability: 2 syntheses sharing the rayon pool
each get ~96 effective threads, inflating synth from 39s to 45-47s.
Higher pipeline fill (j=3) is more effective than thread partitioning.
Proposal 7 replaces the thundering-herd synthesis pattern (all 10
partitions start/finish simultaneously) with a synth worker pool that
processes partitions individually and feeds them to the GPU one at a time.

Key design points:
- 20 synth workers (configurable) each synthesize 1 partition (~29s)
- Workers submit to engine GPU channel; block if full (backpressure)
- GPU proves each partition with num_circuits=1 (b_g2_msm: 0.4s vs 25s)
- ProofAssembler in JobTracker accumulates partitions per job_id
- Cross-sector overlap: next sector's synth starts on free workers

Expected impact: 42.8s/proof → ~30s/proof steady-state (GPU-limited),
~100% GPU utilization, zero inter-sector GPU idle time.

~110 net new lines of code, primarily in engine.rs.
Implement the Phase 7 architecture from c2-optimization-proposal-7.md:
dispatches individual PoRep partitions as independent work units through
the engine's synthesis→GPU pipeline, eliminating the thundering-herd
pattern and enabling cross-sector pipelining.

Key changes:
- SynthesizedJob: add partition_index, total_partitions, parent_job_id
  fields for per-partition routing
- PartitionedJobState: new struct tracking per-job ProofAssembler,
  accumulated timings, and failure state
- PartitionWorkItem: work unit for spawn_blocking synthesis workers
- JobTracker: add assemblers map for in-progress partitioned proofs
- process_batch(): new Phase 7 dispatch path when partition_workers > 0
  and single-sector PoRep C2 — parses C1 once, registers assembler,
  dispatches 10 spawn_blocking tasks gated by partition_semaphore,
  returns immediately (non-blocking)
- GPU worker: partition-aware result routing — routes partition proofs
  to ProofAssembler, delivers final proof when all partitions complete,
  calls malloc_trim(0) after each partition to release memory
- Error handling: failed flag on PartitionedJobState, synthesis/GPU
  failure propagation, skip work for already-failed jobs
- Config: add synthesis.partition_workers (default 20), partition
  semaphore limiting concurrent synthesis workers
- Phase 6 slotted pipeline retained as fallback (partition_workers=0,
  slot_size>0)
- ParsedC1Output and parse_c1_output made pub for engine access
- synthesize_partition made pub for engine dispatch

Expected steady-state: 42.8s/proof → ~30s/proof (GPU-limited), ~100%
GPU utilization, zero cross-sector GPU idle gaps. Per-partition GPU
calls use num_circuits=1, making b_g2_msm 0.4s instead of 25s.
Proposal to eliminate per-partition GPU idle gaps by overlapping one
worker's CPU preamble/epilogue with another worker's CUDA kernel
execution. Two GPU workers per physical GPU share a fine-grained
mutex that brackets only the CUDA kernel region inside
generate_groth16_proofs_c.

Key findings:
- The static mutex in groth16_cuda.cu covers the entire function
  (~3.5s), but actual CUDA kernel time is ~2.1s. The remaining
  ~1.3s is CPU work (preprocessing, b_g2_msm, epilogue) that
  could overlap with the next partition's GPU execution.
- The sppark semaphore_t is a counting semaphore that latches
  notify() before wait(), confirming safe barrier semantics for
  the proposed restructuring.
- Recommended approach: pass mutex pointer from Rust through FFI,
  acquire before per-GPU thread launch, release after per-GPU
  thread join, leaving b_g2_msm and epilogue outside the lock.

Estimated impact: GPU efficiency ~64% → ~98%, throughput ~3-10%
improvement on top of Phase 7.
Narrow the C++ static mutex in generate_groth16_proofs_c to cover only
the CUDA kernel region (NTT+MSM, batch additions, tail MSMs). CPU
preprocessing and b_g2_msm now run outside the lock, allowing two GPU
workers to interleave: one does CPU work while the other runs CUDA.

Changes across 7 files (~195 lines):

- groth16_cuda.cu: Remove static mutex, add std::mutex* parameter,
  acquire lock before per-GPU thread launch, release after per-GPU
  join (before prep_msm_thread join). Add create/destroy_gpu_mutex
  C helpers for FFI allocation.

- supraseal-c2/lib.rs: Add gpu_mtx parameter to FFI decl and both
  generate_groth16_proof wrappers. Export alloc/free_gpu_mutex.

- bellperson supraseal.rs: Add GpuMutexPtr type, SendableGpuMutex
  wrapper, alloc/free helpers. Thread gpu_mutex through
  prove_from_assignments. Legacy callers pass null (fallback mutex).

- pipeline.rs: Thread GpuMutexPtr through gpu_prove(). Internal
  callers pass null_mut() for backward compatibility.

- engine.rs: Create one C++ mutex per GPU via alloc_gpu_mutex().
  Spawn gpu_workers_per_device workers per GPU (default 2), each
  sharing the same mutex address (as usize for Send safety).

- config.rs: Add gpus.gpu_workers_per_device (default 2).

Benchmark results (RTX 5070 Ti, 96-core Zen4, partition_workers=20):

  Single proof:  69.3s wall (GPU efficiency: 100.0% — zero idle gaps)
  Throughput c=5 j=3: 44.0s/proof (Phase 7: 50.7s → 13.2% improvement)
  Throughput c=5 j=2: 49.5s/proof (Phase 7: 59.8s → 17.2% improvement)

  partition_workers=30 regresses to 60.4s/proof due to CPU contention
  from 30 simultaneous synthesis workers starving GPU preprocessing.
Document three new phases of the pipelined SNARK proving engine:

- Phase 6: Pipelined partition proving (slot-based, 62x b_g2_msm speedup)
- Phase 7: Engine-level per-partition pipeline (cross-sector overlap)
- Phase 8: Dual-worker GPU interlock (100% GPU utilization)

Key benchmark findings:
- Optimal partition_workers=10-12 on 96-core machine (43.5s/proof → 37.4s)
- System is perfectly GPU-bound: throughput = serial CUDA kernel time
  (10 partitions × 3.75s = 37.5s vs measured 37.4s/proof)
- Cross-sector GPU transitions are seamless (<50ms after warmup)
- synthesis_concurrency>1 provides no benefit (synthesis already overlapped)

Update file references and related documents for Phases 6-8.
Two changes to reduce GPU SM idle time caused by PCIe transfers
inside the GPU mutex:

1. Pre-stage a/b/c polynomials (6 GiB) outside the mutex via
   cudaHostRegister + async upload on a dedicated copy stream.
   Overlaps with the other worker's CUDA kernels.

2. Deferred batch sync in Pippenger MSM: double-buffer host-side
   bucket results so GPU never waits for CPU to process the
   previous batch. Eliminates 8+ per-batch idle gaps per MSM.

Includes full PCIe transfer inventory (23.6 GiB HtoD per partition)
and expected 4-9% throughput improvement over Phase 8.
…uploads

- Pre-stage a/b/c polynomial uploads using cudaHostRegister + async DMA
  before GPU mutex acquisition (host pinning) and after (device alloc + upload)
- Memory-aware allocation: query cudaMemGetInfo after pool trim, only pre-stage
  if full 12 GiB (d_a + d_bc) fits with 512 MiB safety margin
- Double-buffered deferred batch sync in Pippenger MSM (sppark submodule):
  per-batch sync deferred to next iteration, overlapping DtoH with compute
- Early d_bc free inside per_gpu thread after NTT phase completes
- GPU resources cleaned up before mutex release, host pages unregistered after

Results (gw=1, pw=10, c=3, j=1):
- 32.1s/proof avg (14.2% improvement over Phase 8 baseline 37.4s)
- ntt_msm_h_ms: 2430ms -> 690ms (-71.6%)
- gpu_total_ms: 3746ms -> 1450ms (-61.3%)

gw=2 shows regression (41.0s) due to cudaDeviceSynchronize + pool trim
serialization — needs further investigation.
Add per-stage timing to prestage setup: sync_ms, trim_ms, alloc_ms, upload_ms.

Key findings with c=15 j=15 gw=1:
- Pre-staging overhead: 18ms avg (negligible - PCIe gen5 is fast)
- GPU kernels: 1824ms avg/partition
- CPU critical path (prep_msm + b_g2_msm): 2393ms avg/partition
- CPU is the bottleneck, not GPU — DDR5 bandwidth wall
  with 10 concurrent synthesis workers competing for memory
- Throughput: 41.3s/proof (steady-state)
- c=30 j=20 causes OOM/crash from memory pressure
Phase 9 cuts GPU kernel time 51% (3.7s→1.8s/partition) but steady-state
throughput only improves 14% (37.4→32.1s in isolation) because CPU
preprocessing (prep_msm + b_g2_msm = 2.4s/partition) is now the critical
path. At high concurrency, 10 synthesis workers saturate 8-channel DDR5
bandwidth, slowing CPU MSM operations 12-27% and limiting throughput to
~41s/proof.
Phase 10 (two-lock GPU interlock) was implemented, tested, and abandoned:
- 16 GB VRAM too small for 2 workers' pre-staged buffers
- CUDA memory APIs are device-global, serializing across streams
- Phase 9 already hides b_g2_msm behind GPU lock release

Phase 11 design spec identifies 3 sources of throughput degradation
(32.1s isolation → 38.0s at c=20 j=15) and proposes 3 interventions:
1. Serialize async_dealloc to bound TLB shootdown storms
2. Reduce groth16_pool to 32 threads to cut L3 thrashing
3. Memory-bandwidth throttle during b_g2_msm via shared atomic

Also reverts groth16_cuda.cu Phase 10 timing instrumentation back to
Phase 9 state.
Three interventions to reduce CPU memory subsystem contention at high
concurrency (c=20 j=15):

1. Serialize async_dealloc threads (static mutex in C++ and Rust) to
   prevent concurrent munmap() TLB shootdown storms. Alone: negligible.

2. Reduce groth16_pool from 192 to 32 threads (gpu_threads=32 config).
   Cuts b_g2_msm L3 cache footprint from ~1.1 GiB to ~192 MB. b_g2_msm
   slows from 0.5s to 1.7s but runs outside GPU lock. Best result:
   36.7s/proof (3.4% improvement over Phase 9 baseline of 38.0s).

3. Memory-bandwidth throttle: global AtomicI32 flag set by C++ around
   b_g2_msm, checked by Rust SpMV every 64 chunks with yield_now().
   No additional gain over Intervention 2 alone.

Also tested gw=3 (37.2s) and gw=4 (37.4s) — both worse due to CPU
contention from additional GPU workers.

Optimal config: gw=2, pw=10, gpu_threads=32 → 36.7s/proof.
Decouple b_g2_msm CPU computation from the GPU worker loop so the GPU
worker can pick up the next synthesized partition ~1.7s faster. The C++
generate_groth16_proofs_c is refactored into start (returns pending
handle after GPU lock release) + finalize (joins b_g2_msm, runs
epilogue). GPU workers spawn a separate tokio finalizer task and
immediately loop back for the next job.

Key changes:
- C++ groth16_pending_proof struct holds all shared state on the heap
- generate_groth16_proofs_start_c / finalize_groth16_proof_c split API
- Fix use-after-free: prep_msm_thread now reads provers_owned (heap
  copy) instead of the stack parameter that goes out of scope
- Rust FFI: start_groth16_proof, finish_groth16_proof, drop_pending_proof
- Bellperson: PendingProofHandle<E>, prove_start(), finish_pending_proof()
- Pipeline: gpu_prove_start() / gpu_prove_finish(), PendingGpuProof alias
- Engine: GPU worker restructured with spawned finalizer task; extracted
  process_partition_result() and process_monolithic_result() helpers
- SynthesisCapacityHint struct added (was referenced but undefined)
- Removed unused PR generic from start_groth16_proof FFI

Benchmark (gw=2 pw=10 gt=32, c=20 j=15): 37.1s/proof throughput
(vs 38.0s Phase 11 baseline, ~2.4% improvement).
@magik6k
Copy link
Copy Markdown
Collaborator Author

magik6k commented Feb 27, 2026

Do you plan to do more bench mark on different hardware?

Definitely, but already benchmarked under some pipeline constrains which should be fairly representative.

Generally CuZK is shipped here as an entirely separate deamon which you connect to a Curio node as a 100% optional addon, so generally this integration is as safe as it gets.

I'll run this on more diverse HW once we get this PR in, but I expect it to be universally faster and cheaper:

  • This uses less average RAM per proof execute
  • It lowers minimum RAM required below even pre-supraseal-c2 requirements
  • It optimizes some operations to the point that they are guaranteed to always be faster
  • Heavier pipelining guarantees that bigger hardware is much better utilized
  • Synth pipelines have memory-bandwidth optimizing semaphores ensuring really memory-bandwidth-expensive compute doesn't overlap

I'd be very surprised if there is any machine configuration that is worse with this.

Eventually over time we might figure out a good way to pull this deamon into the main Curio binary, but initially seems wise to gain confidence in maximally sandboxed way.

FWIW I just verified valid PoSt and PoRep proofs on Calibnet so that's already something.

@magik6k magik6k marked this pull request as ready for review February 27, 2026 23:08
@magik6k magik6k requested a review from a team as a code owner February 27, 2026 23:08
@filecoin-project filecoin-project deleted a comment from cursor bot Feb 27, 2026
magik6k added 5 commits March 1, 2026 14:27
Thread a gpu_index parameter through the entire proving stack
(C++ -> supraseal-c2 -> bellperson -> pipeline -> engine) so that
single-circuit partition proofs run on the GPU assigned to the Rust
worker instead of always landing on GPU 0.

Previously, the C++ code computed n_gpus = min(ngpus(), num_circuits),
which for single-circuit proofs always resolved to GPU 0 via
select_gpu(0). This made per-GPU mutexes ineffective on multi-GPU
systems: workers assigned to different GPUs could run CUDA kernels
simultaneously on GPU 0, causing proof corruption (the original
shared-mutex workaround serialized everything to one GPU).

Now gpu_index >= 0 pins work to that specific GPU, while -1 preserves
the original multi-GPU fan-out for batched proofs. Also converts the
global d_a_cache singleton to a per-GPU array to avoid thrashing when
workers on different GPUs run concurrently.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should clean those up, potentially we do want those around somewhere in code docs as knowledge bases / additional context mostly for Agents to use?

@snadrus
Copy link
Copy Markdown
Contributor

snadrus commented Mar 7, 2026 via email

magik6k added 15 commits March 13, 2026 11:45
…eanup

Fix three proofshare provider bugs:

1. CreateWorkAsk deadlock: When the service returns HTTP 429 (TooManyRequests),
   CreateWorkAsk retried forever, blocking the Do() loop from polling for work
   matched to existing asks. This created a permanent deadlock where no work
   could be inserted into proofshare_queue. Now returns ErrTooManyRequests
   immediately; the caller applies exponential backoff (up to 2min) on
   no-progress iterations and resets to 3s when progress is made.

2. cuzk job_id collision: PSProve PoRep RequestId was fmt.Sprintf("ps-porep-%d-%d",
   miner, sector) which is identical for all concurrent proofshare challenges
   (all target miner=1000, sector=1). This caused the cuzk engine's partition
   assembler to mix results from different proofs, producing 0/10 valid
   partitions. Now includes taskID for uniqueness.

3. Queue maintenance: proofshare_queue rows with submit_done=TRUE were never
   deleted, causing unbounded table growth and expensive dedup SELECTs.
   Completed rows older than 2 days are now purged every 5 minutes.
   Orphaned compute tasks are reset (UPDATE SET NULL) instead of deleted
   so work can be re-assigned. Dedup SELECT scoped to submit_done=FALSE.
Replace the static partition_workers semaphore with a unified MemoryBudget
that tracks all major memory consumers (SRS pinned, PCE heap, synthesis
working set) under a single byte-level budget auto-detected from system RAM.

- Add MemoryBudget and MemoryReservation with RAII partial-release support
- Add PceCache (replaces static OnceLock PCE storage) with LRU eviction
- Make SrsManager budget-aware with on-demand loading and eviction
- Two-phase working memory release: a/b/c freed after prove_start,
  rest after prove_finish
- Remove partition_workers, srs.preload, pinned_budget config fields
- Add total_budget (auto/explicit), safety_margin, eviction_min_idle config
- Backward-compatible config parsing (old fields ignored with warnings)
- All 15 unit tests pass, pce-bench validation passes on real 32G data
The evictor callback runs from async budget.acquire(), so calling
blocking_lock() on the tokio Mutex panics with 'Cannot block the
current thread from within a runtime'. Switch to try_lock() and
skip SRS eviction candidates when the mutex is held — the acquire
loop retries so they'll be caught on the next iteration.
Adds a StatusTracker that records pipeline, GPU worker, and memory state
as proof jobs flow through the engine. A minimal raw-TCP HTTP/1.1 server
on a configurable port (daemon.status_listen) serves GET /status returning
JSON snapshots at 500ms polling granularity.

Tracks per-partition lifecycle (synthesizing/synth_done/gpu/done/failed),
GPU worker busy/idle state, memory budget usage, SRS/PCE allocations,
buffer flight counters, and aggregate completion stats. Completed jobs
are garbage-collected after 30 seconds.
Adds a real-time cuzk engine status visualization to the vast-manager UI.

Backend: /api/cuzk-status/{uuid} endpoint that SSH-tunnels to the remote
instance (via ControlMaster for connection reuse) and proxies the cuzk
/status JSON response.

Frontend: 1.5s-interval polling panel showing memory budget gauge,
synthesis concurrency, completion counters, per-partition pipeline
waterfall with state-colored cells (pending/synthesizing/synth_done/
gpu/done/failed) and timing, GPU worker cards, SRS/PCE allocation
list, and buffer flight counters. Cached last response avoids flash
on dashboard refresh cycles.
…ation

status.rs: partition_gpu_end now only clears a worker's busy state if
the worker is still assigned to the same job+partition. With split GPU
proving, the finalizer task may complete after the worker already picked
up a new job — the unconditional clear was clobbering the new state,
making workers appear permanently idle.

ui.html: show 16 chars of job_id instead of 8 to avoid cutting off
at the prefix boundary (e.g. "ps-snap-" was meaningless).
Replace per-partition tokio::spawn with a shared mpsc channel and
synthesis worker pool that processes partitions in FIFO arrival order.
This ensures earlier jobs' partitions are synthesized and GPU-proved
before later jobs, preventing all pipelines from stalling together.

The synthesis worker pool is unified — handles both PoRep and SnapDeals
via ParsedProofInput match, eliminating ~300 lines of duplicated
synthesis/error-handling logic.

Also compute synthesis.max_concurrent dynamically from the memory budget
(total_bytes / smallest_partition_size) instead of using the static
synthesis_concurrency config value, which was misleading (showed 4
when actual budget-gated concurrency could be 44).

Tested: 2 concurrent PoRep C2 proofs completed (0.485 proofs/min),
followed by live SnapDeals processing with correct budget gating.
Replace FIFO mpsc channels with BTreeMap-based priority queues for
both the synthesis work queue and the GPU proving queue. Items are
keyed on (job_seq, partition_idx) where job_seq is a monotonically
increasing counter assigned at pipeline dispatch time.

This ensures both synthesis workers and GPU workers always pick the
lowest partition in the oldest pipeline, completing jobs sequentially
rather than interleaving partitions randomly across pipelines.

Before: all partitions from concurrent jobs raced on Notify-based
budget acquire, causing random GPU assignment (e.g., Job A P0 and
Job B P5 on GPU simultaneously). Result: all pipelines stalled
together at 0.485 proofs/min.

After: Job A completes fully before Job B starts GPU work. Measured
0.602 proofs/min (24% improvement) — Job A finishes in 114s without
contention, vs ~245s when interleaved.
Split the synthesis worker pool into two stages:
1. A single dispatcher task that serializes pop + budget acquire
2. Worker pool that receives (item, reservation) via bounded channel

Previously, N workers each popped an item from the priority queue
then raced for budget — budget went to whichever worker's acquire()
completed first, causing out-of-order synthesis (e.g., P4 before P0).

Now the single dispatcher ensures budget is allocated in strict
(job_seq, partition_idx) order: lowest partition in oldest pipeline
always gets budget first. The bounded channel preserves this order
into the worker pool.
Add support for CUDA-pinned memory backing in ProvingAssignment to
enable fast H2D transfers at PCIe line rate (~50 GB/s) instead of
going through CUDA's internal bounce buffer (~1-4 GB/s).

- PinnedBacking struct tracks raw pinned ptrs and pool return callback
- PinnedReturnFn type for returning buffers to a pool on release
- new_with_pinned() constructor creates a/b/c via Vec::from_raw_parts
- release_abc() mem::forgets pinned Vecs and calls return callback
- Drop impl ensures pinned buffers are returned even without explicit release
- prove_start uses release_abc() instead of manual Vec replacement
- synthesize_circuits_batch_with_prover_factory() accepts closure for
  custom prover creation, enabling callers to inject pinned-backed provers
Fix GPU underutilization caused by slow H2D PCIe transfers from unpinned
host memory. CUDA's cudaMemcpyAsync from unpinned memory goes through a
small internal bounce buffer at ~1-4 GB/s instead of PCIe Gen5 line rate
(~50 GB/s), causing 2-14s NTT stalls per partition.

PinnedPool (pinned_pool.rs):
- Pool of CUDA pinned buffers (cudaHostAlloc/cudaFreeHost)
- checkout/checkin with size-aware reuse (returns smallest fitting buffer)
- PinnedAbcBuffers: atomically checks out 3 buffers for a/b/c vectors
- Not budget-integrated: pinned memory replaces heap a/b/c allocations

Reactive dispatch throttle (engine.rs):
- Semaphore-based 1:1 modulation via max_gpu_queue_depth config
- Dispatcher acquires permit before starting each synthesis
- GPU finalizer releases permit after prove_finish completes
- Prevents burst dispatch that caused cudaHostAlloc serialization stalls
  and pinned pool thrashing (474 allocs / 12 reuses -> 24 allocs / 48 reuses)

Pipeline integration (pipeline.rs):
- synthesize_with_hint wires PinnedPool into prover factory closure
- Graceful fallback to unpinned on pool exhaustion with warning log

C++ timing instrumentation (groth16_cuda.cu):
- mutex_wait_ms, barrier_wait_ms, mutex_held_ms for GPU pipeline profiling

Results: NTT+H2D dropped from 2-14s to 0ms per partition, total GPU time
per partition dropped from 8-19s to ~950ms, budget freed for PCE caching.
Replace the burst-based P-controller with a continuous pacer that
regulates synthesis dispatch rate using PI control with GPU rate
feed-forward.

DispatchPacer:
- Feed-forward: EMA of GPU inter-completion interval (measured via
  atomic completion counter incremented by GPU finalizers)
- Feedback: PI correction on (target - EMA_waiting), where EMA_waiting
  is a smoothed gpu_work_queue.len()
- Conservative gains (Kp=0.1, Ki=0.008) for the 20-60s synthesis delay
- Output: dispatch interval clamped to [50ms, 60s]

Bootstrap phase:
- Dispatches target items at 200ms spacing to prime the pipeline
- Then waits for the first GPU completion to calibrate the GPU rate EMA
- Switches to PI control once calibrated

Steady state:
- Dispatches one item per timer tick at the PI-computed interval
- GPU events update pacer state but don't directly trigger dispatch
- Converges to matching GPU consumption rate with target items waiting
- Periodic status logging every 5 dispatches

This replaces the previous burst dispatch which caused:
- Pinned pool exhaustion -> cudaHostAlloc -> GPU driver serialization
- GPU timing jitter -> pacing instability -> more burst dispatch
The steady dispatch rate keeps concurrent synthesis count stable,
so the pinned pool stays within its warm allocation.
…ynth cap, re-bootstrap

Major pacer redesign to fix three collapse modes:

1. GPU rate measurement: use actual GPU worker processing durations
   (atomically accumulated) instead of inter-completion intervals.
   The old approach included idle time when the queue drained,
   creating a self-reinforcing collapse. With N interleaved workers,
   effective interval = processing_time / N.

2. Remove synthesis throughput cap: the cap created a vicious cycle
   where slow dispatch → fewer concurrent synths → slower throughput
   → tighter cap → even slower dispatch. Memory budget provides the
   correct backpressure via budget.acquire() blocking.

3. Re-bootstrap on pipeline drain: when ema_waiting < 1.0, re-enter
   bootstrap to refill the pipe. Integral resets (stale from previous
   batch), GPU rate EMA preserved (hardware characteristic).

4. Slow bootstrap from 200ms to 3s initial / max(2s, gpu_eff) for
   re-bootstrap. Fast bootstrap flooded the pinned pool with
   concurrent cudaHostAlloc calls that serialized through the GPU
   driver, stalling all GPU activity.
PI tuning:
- Normalize error by target: (target - waiting) / target gives [-2, +1]
  range regardless of target value. Gains are now target-independent.
- kp 0.1 → 0.5: P does the heavy lifting on normalized error.
  Half-empty queue → 25% faster, empty → 50% faster, overfull → 50% slower.
- ki 0.008 → 0.02: gentle drift correction.
- Asymmetric integral clamp: +2.0 / -0.5 instead of ±20.0. Negative
  integral (slow down) is heavily restricted — aggressive backoff was
  draining the entire pipeline after memory ceiling slams.
- rate_mult clamp [0.1, 5.0] → [0.3, 3.0]: at most 3.3x slowdown.

Re-bootstrap fix:
- Old condition: ema_waiting < 1.0 (GPU queue low). This triggered
  re-bootstrap spam while items were still in the synthesis pipeline
  (30-60s to reach GPU queue). 42+ re-bootstraps in minutes.
- New condition: also require total_dispatched <= gpu_completions
  (pipeline truly empty — nothing in synthesis or GPU queue).
  Only re-bootstraps between actual proof batches.

Add in_flight metric to status log (dispatched - gpu_completions)
for pipeline depth visibility.
Hard cap on concurrent synthesis workers in the pipeline path.
Too many concurrent syntheses causes CPU contention and DDR5
memory bandwidth saturation, making each synthesis slower and
reducing overall throughput. 18 is a good default for 64-core
DDR5 systems.

Config: max_parallel_synthesis (default 18, 0 = use default).
Applied as: synth_worker_count = min(budget_partitions, max_parallel).
Copy link
Copy Markdown
Contributor

@snadrus snadrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Go integration works, but depends on the admin running CuZK ahead of time and on the same server (or else scheduling which still runs gets sideways).
Can we limit addresses to localhost?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants