feat(cuda): add CUDA IPC tensor transport + Python data-scientist ergonomics#127
Open
YuanYuYuan wants to merge 20 commits intomainfrom
Open
feat(cuda): add CUDA IPC tensor transport + Python data-scientist ergonomics#127YuanYuYuan wants to merge 20 commits intomainfrom
YuanYuYuan wants to merge 20 commits intomainfrom
Conversation
|
- Patch zenoh-buffers to use feat/cuda-zslice (ZSliceKind::CudaPtr) - Generalize ShmConfig<B: ShmProviderBackend> with default POSIX backend - Add cuda ZBuf serde fast path: CudaPtr slices bypass pointer comparison - Add ros-z-py CUDA methods: is_cuda getter and as_dlpack() DLPack capsule
Adds cuda and ros-{distro}-cuda dev shells that set CUDA_PATH and
LD_LIBRARY_PATH so the zenoh-cuda build.rs can find libcudart on NixOS.
Also allows unfree packages (required for CUDA runtime).
Patch all zenoh crates from feat/cuda-zslice worktree to unify the dependency graph when --features cuda is active. Feature unification causes zenoh-buffers/cuda to propagate to zenoh-codec, which requires the local codec arm for ZSliceKind::CudaPtr. Patching only zenoh-buffers left non-exhaustive matches; each additional crate was needed to resolve type identity conflicts in the transport/shm/core layer. Propagate ros-z-py/cuda → ros-z/cuda so zenoh-codec/cuda is activated. Remove spurious SplitBuffer imports in payload_view.rs.
Add /run/opengl-driver/lib to LD_LIBRARY_PATH so libcuda.so (the host driver) is found at runtime. Add /run/current-system/sw/bin to PATH so nvidia-smi is accessible from within nix develop .#cuda.
build.rs looks in $CUDA_PATH/lib for libcudart.so. cuda_cudart (dev) only has headers; cuda_cudart.lib has the actual shared library.
…b.py
Two bugs prevented the two-process CUDA IPC example from working:
1. ZContextBuilder::with_shm_enabled() created a ros-z SHM provider
but never set transport/shared_memory/enabled=true in the zenoh
config. common_overrides() forces it to false, so ShmContext::new
returned None and ext_shm was absent from InitSyn/InitAck.
Fix: chain .with_json("transport/shared_memory/enabled", json!(true)).
2. The cudaIpcOpenMemHandle FFI (in zenoh-cuda) had the wrong calling
convention (pointer instead of by-value struct) — fixed in the
zenoh feat/cuda-zslice patch; this commit updates Cargo.lock.
cuda_pubsub.py: add with_logging_enabled(), time.sleep() for
subscription propagation and to keep the IPC handle alive until
the subscriber maps it, and --timeout CLI argument.
DLPack deleter must free the DLManagedTensor itself per spec.
Previous code left it for the capsule destructor, but the capsule
destructor also found the 'used_dltensor' pointer and freed it again
→ double free or corruption (fasttop) when torch.from_dlpack() ran.
Fix: dlpack_deleter now frees ctx + DLManagedTensor; capsule
destructor only handles the uncollected case ('dltensor' name).
cuda_pubsub.py: add --warmup flag (default 1s) so callers can
increase it when subscriber needs time for torch CUDA init.
Use --warmup 5 when running with --torch on first launch.
…nsor ZSlice Add DLPack shape/dtype metadata round-trip so subscribers receive correctly shaped tensors without out-of-band conventions: - zenoh-buffers: ZSliceKind::CudaTensor (kind=3) for typed tensor slices - zenoh-cuda: TensorMeta struct (ndim, shape, dtype, strides, byte_offset) - zenoh-cuda: CudaBufInner::with_tensor_meta(), tensor_meta() accessor - ros-z zbuf: ZBuf::from_cuda_tensor(), typed_cuda_slices() iterator - ros-z zbuf: CudaPtr|CudaTensor fast path in visit_borrowed_bytes - ros-z-py cuda_buf: PyCudaBuf::from_torch() — zero-copy torch tensor pub - ros-z-py cuda_buf: with_tensor_meta(), parse_torch_dtype, c_contiguous_strides - ros-z-py cuda_buf: into_zbuf() routes to from_cuda_tensor when meta present - ros-z-py payload_view: is_cuda() and as_dlpack() handle CudaTensor kind - ros-z-py payload_view: as_dlpack() uses TensorMeta for shape/dtype/strides - cuda_pubsub.py: --torch mode uses from_torch(); sub verifies shape+dtype - cuda_transport.rs: fix unused_mut warning
…sh_tensor, pyi stubs High-level API so users never touch DLPack, CudaZBuf, or wire formats: pub.publish_tensor(tensor) # torch/numpy, cuda or cpu tensor = view.as_torch() # CUDA (dlpack) or CPU (npy decode) arr = view.as_numpy() # same, returns ndarray Changes: - ZPayloadView.as_torch(): CUDA via from_dlpack; CPU via numpy NPY decode; fallback uint8 - ZPayloadView.as_numpy(): CUDA via torch->cpu->numpy; CPU via numpy NPY decode - ZPayloadView.raw_bytes(): expose raw payload bytes (used by CPU decoder) - ZPublisher.publish_tensor(): dispatches on device; CUDA uses from_torch + publish_zbuf with auto-keepalive; CPU serializes via numpy.save (NPY format) - ZPublisher.publish_bytes(): raw bytes publish path - CudaZBuf: keepalive: Option<PyObject> field; into_zbuf(keepalive=None) - PyCudaBuf.from_torch / into_zbuf: pub for Rust callers (publish_tensor) - ros_z_py.pyi: full type stubs for PyCudaBuf, CudaZBuf, ZPayloadView, ZPublisher, ZSubscriber, ZContext/ZNode
…ish_tensor/as_torch
Regenerated after zenoh patch branch gained zenoh-mem-transport crate and ext_mem (0x8) capability negotiation extension.
Add ort_publisher.py (Python/GPU) + ort_classifier.rs (Rust/ORT) as a complete end-to-end example: torchvision preprocessing on GPU → CUDA IPC → SqueezeNet 1.1 inference via ONNX Runtime CUDAExecutionProvider. - examples/ort_publisher.py: publish [1,3,224,224] f32 tensor via publish_tensor() - examples/ort_classifier.rs: recv_serialized_timeout → typed_cuda_slices → copy_to_host → ort Session → top-1 ImageNet label; ort/ndarray gated behind optional cuda feature to avoid network download in CI - flake.nix: add gcc libstdc++ to cudaShellHook LD_LIBRARY_PATH so pip-installed torch finds libstdc++.so.6 without manual env hacks - book: document the pipeline, setup steps, and asset download commands - .gitignore: exclude *.onnx, imagenet_classes.txt, test_image.jpg
…mem-transport Fixes CI: the local worktree paths don't exist on CI runners. Also fix serialize_to_shm<B> impls to match the now-generic trait signature from upstream main (ShmProviderBackend type parameter).
…t signature The upstream main generalized serialize_to_shm to take B: ShmProviderBackend. Fix the rmw-zenoh-rs impl (missed in the previous commit) and switch Cargo.toml patches back to git refs for CI.
test_typed_tensor_zbuf: verifies TensorMeta (shape, dtype, strides) survives ZBuf wrapping and is returned by typed_cuda_slices(); also confirms CudaPtr without metadata does not appear there. test_native_ipc_handle_non_zero: verifies alloc_device produces a non-zero cudaIpcGetMemHandle that survives ZBuf wrapping — the handle the subscriber will open. Full two-process IPC is covered by the ORT cross-language demo.
FlakeHub Cache requires a paid subscription ($20/member/month) since the free Magic Nix Cache tier was discontinued in Feb 2025. All 7 flakehub-cache-action steps removed from ci.yml, docs.yml, and mdbook-preview.yml. Cachix (ros.cachix.org) remains the Nix binary cache for ROS jobs. id-token: write removed from jobs that no longer need it.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Dependency
Depends on eclipse-zenoh/zenoh#2463 — CUDA IPC transport (
ZSliceKind::CudaPtr,zenoh-cudacrate,zenoh-mem-transportCUDA backend). This branch patches the zenoh stack viaCargo.tomlpath overrides until that PR merges.Summary
End-to-end zero-copy GPU tensor transport from Python publishers to Rust or Python subscribers, plus a cross-language ML inference demo (Python torchvision → Rust ONNX Runtime).
Key Changes
ZSliceKind::CudaPtr/CudaTensorZSlices carry GPU memory as a 77-byte IPC handle — no CPU copy of tensor data on the wireZBufCUDA API:ZBuf::from_cuda(),typed_cuda_slices()iterator,copy_to_host()for D→H transferros-z-py):PyCudaBuf,publish_tensor(),publish_zbuf(),recv_raw_view(),as_torch(),as_numpy(),as_dlpack()— full PyO3 + pyi stubspublish_tensor()also handles CPU tensors and NumPy arrays via NPY wire formatort_publisher.py(Python/GPU) +ort_classifier.rs(Rust/ONNX Runtime) — SqueezeNet 1.1 inference over CUDA IPCLD_LIBRARY_PATH(cudart + host driver + gcc libstdc++ for pip torch)gpu-tensor-transportchapter covering architecture, API, lifetime contract, and the ORT demoBreaking Changes
None