Skip to content

feat(cuda): add CUDA IPC tensor transport + Python data-scientist ergonomics#127

Open
YuanYuYuan wants to merge 20 commits intomainfrom
dev/torch
Open

feat(cuda): add CUDA IPC tensor transport + Python data-scientist ergonomics#127
YuanYuYuan wants to merge 20 commits intomainfrom
dev/torch

Conversation

@YuanYuYuan
Copy link
Copy Markdown
Collaborator

Dependency

Depends on eclipse-zenoh/zenoh#2463 — CUDA IPC transport (ZSliceKind::CudaPtr, zenoh-cuda crate, zenoh-mem-transport CUDA backend). This branch patches the zenoh stack via Cargo.toml path overrides until that PR merges.

Summary

End-to-end zero-copy GPU tensor transport from Python publishers to Rust or Python subscribers, plus a cross-language ML inference demo (Python torchvision → Rust ONNX Runtime).

Key Changes

  • CUDA IPC transport: ZSliceKind::CudaPtr/CudaTensor ZSlices carry GPU memory as a 77-byte IPC handle — no CPU copy of tensor data on the wire
  • ZBuf CUDA API: ZBuf::from_cuda(), typed_cuda_slices() iterator, copy_to_host() for D→H transfer
  • Python bindings (ros-z-py): PyCudaBuf, publish_tensor(), publish_zbuf(), recv_raw_view(), as_torch(), as_numpy(), as_dlpack() — full PyO3 + pyi stubs
  • CPU tensor path: publish_tensor() also handles CPU tensors and NumPy arrays via NPY wire format
  • ORT demo: ort_publisher.py (Python/GPU) + ort_classifier.rs (Rust/ONNX Runtime) — SqueezeNet 1.1 inference over CUDA IPC
  • Nix: CUDA dev shells with correct LD_LIBRARY_PATH (cudart + host driver + gcc libstdc++ for pip torch)
  • Book: new gpu-tensor-transport chapter covering architecture, API, lifetime contract, and the ORT demo

Breaking Changes

None

@YuanYuYuan YuanYuYuan changed the title feat(cuda): CUDA IPC tensor transport + Python data-scientist ergonomics feat(cuda): add CUDA IPC tensor transport + Python data-scientist ergonomics Mar 9, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 9, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://ZettaScaleLabs.github.io/ros-z/pr-preview/pr-127/

Built to branch gh-pages at 2026-03-13 14:24 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

- Patch zenoh-buffers to use feat/cuda-zslice (ZSliceKind::CudaPtr)
- Generalize ShmConfig<B: ShmProviderBackend> with default POSIX backend
- Add cuda ZBuf serde fast path: CudaPtr slices bypass pointer comparison
- Add ros-z-py CUDA methods: is_cuda getter and as_dlpack() DLPack capsule
Adds cuda and ros-{distro}-cuda dev shells that set CUDA_PATH and
LD_LIBRARY_PATH so the zenoh-cuda build.rs can find libcudart on NixOS.
Also allows unfree packages (required for CUDA runtime).
Patch all zenoh crates from feat/cuda-zslice worktree to unify the
dependency graph when --features cuda is active. Feature unification
causes zenoh-buffers/cuda to propagate to zenoh-codec, which requires
the local codec arm for ZSliceKind::CudaPtr. Patching only zenoh-buffers
left non-exhaustive matches; each additional crate was needed to resolve
type identity conflicts in the transport/shm/core layer.

Propagate ros-z-py/cuda → ros-z/cuda so zenoh-codec/cuda is activated.
Remove spurious SplitBuffer imports in payload_view.rs.
Add /run/opengl-driver/lib to LD_LIBRARY_PATH so libcuda.so (the host
driver) is found at runtime. Add /run/current-system/sw/bin to PATH so
nvidia-smi is accessible from within nix develop .#cuda.
build.rs looks in $CUDA_PATH/lib for libcudart.so. cuda_cudart (dev)
only has headers; cuda_cudart.lib has the actual shared library.
…b.py

Two bugs prevented the two-process CUDA IPC example from working:

1. ZContextBuilder::with_shm_enabled() created a ros-z SHM provider
   but never set transport/shared_memory/enabled=true in the zenoh
   config.  common_overrides() forces it to false, so ShmContext::new
   returned None and ext_shm was absent from InitSyn/InitAck.
   Fix: chain .with_json("transport/shared_memory/enabled", json!(true)).

2. The cudaIpcOpenMemHandle FFI (in zenoh-cuda) had the wrong calling
   convention (pointer instead of by-value struct) — fixed in the
   zenoh feat/cuda-zslice patch; this commit updates Cargo.lock.

cuda_pubsub.py: add with_logging_enabled(), time.sleep() for
subscription propagation and to keep the IPC handle alive until
the subscriber maps it, and --timeout CLI argument.
DLPack deleter must free the DLManagedTensor itself per spec.
Previous code left it for the capsule destructor, but the capsule
destructor also found the 'used_dltensor' pointer and freed it again
→ double free or corruption (fasttop) when torch.from_dlpack() ran.

Fix: dlpack_deleter now frees ctx + DLManagedTensor; capsule
destructor only handles the uncollected case ('dltensor' name).

cuda_pubsub.py: add --warmup flag (default 1s) so callers can
increase it when subscriber needs time for torch CUDA init.
Use --warmup 5 when running with --torch on first launch.
…nsor ZSlice

Add DLPack shape/dtype metadata round-trip so subscribers receive correctly
shaped tensors without out-of-band conventions:

- zenoh-buffers: ZSliceKind::CudaTensor (kind=3) for typed tensor slices
- zenoh-cuda: TensorMeta struct (ndim, shape, dtype, strides, byte_offset)
- zenoh-cuda: CudaBufInner::with_tensor_meta(), tensor_meta() accessor
- ros-z zbuf: ZBuf::from_cuda_tensor(), typed_cuda_slices() iterator
- ros-z zbuf: CudaPtr|CudaTensor fast path in visit_borrowed_bytes
- ros-z-py cuda_buf: PyCudaBuf::from_torch() — zero-copy torch tensor pub
- ros-z-py cuda_buf: with_tensor_meta(), parse_torch_dtype, c_contiguous_strides
- ros-z-py cuda_buf: into_zbuf() routes to from_cuda_tensor when meta present
- ros-z-py payload_view: is_cuda() and as_dlpack() handle CudaTensor kind
- ros-z-py payload_view: as_dlpack() uses TensorMeta for shape/dtype/strides
- cuda_pubsub.py: --torch mode uses from_torch(); sub verifies shape+dtype
- cuda_transport.rs: fix unused_mut warning
…sh_tensor, pyi stubs

High-level API so users never touch DLPack, CudaZBuf, or wire formats:

  pub.publish_tensor(tensor)   # torch/numpy, cuda or cpu
  tensor = view.as_torch()     # CUDA (dlpack) or CPU (npy decode)
  arr    = view.as_numpy()     # same, returns ndarray

Changes:
- ZPayloadView.as_torch(): CUDA via from_dlpack; CPU via numpy NPY decode; fallback uint8
- ZPayloadView.as_numpy(): CUDA via torch->cpu->numpy; CPU via numpy NPY decode
- ZPayloadView.raw_bytes(): expose raw payload bytes (used by CPU decoder)
- ZPublisher.publish_tensor(): dispatches on device; CUDA uses from_torch +
  publish_zbuf with auto-keepalive; CPU serializes via numpy.save (NPY format)
- ZPublisher.publish_bytes(): raw bytes publish path
- CudaZBuf: keepalive: Option<PyObject> field; into_zbuf(keepalive=None)
- PyCudaBuf.from_torch / into_zbuf: pub for Rust callers (publish_tensor)
- ros_z_py.pyi: full type stubs for PyCudaBuf, CudaZBuf, ZPayloadView,
  ZPublisher, ZSubscriber, ZContext/ZNode
Regenerated after zenoh patch branch gained zenoh-mem-transport crate
and ext_mem (0x8) capability negotiation extension.
Add ort_publisher.py (Python/GPU) + ort_classifier.rs (Rust/ORT) as a
complete end-to-end example: torchvision preprocessing on GPU → CUDA IPC
→ SqueezeNet 1.1 inference via ONNX Runtime CUDAExecutionProvider.

- examples/ort_publisher.py: publish [1,3,224,224] f32 tensor via publish_tensor()
- examples/ort_classifier.rs: recv_serialized_timeout → typed_cuda_slices
  → copy_to_host → ort Session → top-1 ImageNet label; ort/ndarray gated
  behind optional cuda feature to avoid network download in CI
- flake.nix: add gcc libstdc++ to cudaShellHook LD_LIBRARY_PATH so
  pip-installed torch finds libstdc++.so.6 without manual env hacks
- book: document the pipeline, setup steps, and asset download commands
- .gitignore: exclude *.onnx, imagenet_classes.txt, test_image.jpg
…mem-transport

Fixes CI: the local worktree paths don't exist on CI runners. Also fix
serialize_to_shm<B> impls to match the now-generic trait signature from
upstream main (ShmProviderBackend type parameter).
…t signature

The upstream main generalized serialize_to_shm to take B: ShmProviderBackend.
Fix the rmw-zenoh-rs impl (missed in the previous commit) and switch
Cargo.toml patches back to git refs for CI.
test_typed_tensor_zbuf: verifies TensorMeta (shape, dtype, strides)
survives ZBuf wrapping and is returned by typed_cuda_slices(); also
confirms CudaPtr without metadata does not appear there.

test_native_ipc_handle_non_zero: verifies alloc_device produces a
non-zero cudaIpcGetMemHandle that survives ZBuf wrapping — the handle
the subscriber will open. Full two-process IPC is covered by the ORT
cross-language demo.
FlakeHub Cache requires a paid subscription ($20/member/month) since
the free Magic Nix Cache tier was discontinued in Feb 2025.
All 7 flakehub-cache-action steps removed from ci.yml, docs.yml, and
mdbook-preview.yml. Cachix (ros.cachix.org) remains the Nix binary
cache for ROS jobs. id-token: write removed from jobs that no longer
need it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant