feat(cuda): add CUDA IPC tensor transport + Python data-scientist ergonomics by YuanYuYuan · Pull Request #127 · ZettaScaleLabs/ros-z

YuanYuYuan · 2026-03-09T04:50:40Z

Dependency

Depends on eclipse-zenoh/zenoh#2463 — CUDA IPC transport (ZSliceKind::CudaPtr, zenoh-cuda crate, zenoh-mem-transport CUDA backend). This branch patches the zenoh stack via Cargo.toml path overrides until that PR merges.

Summary

End-to-end zero-copy GPU tensor transport from Python publishers to Rust or Python subscribers, plus a cross-language ML inference demo (Python torchvision → Rust ONNX Runtime).

Key Changes

CUDA IPC transport: ZSliceKind::CudaPtr/CudaTensor ZSlices carry GPU memory as a 77-byte IPC handle — no CPU copy of tensor data on the wire
ZBuf CUDA API: ZBuf::from_cuda(), typed_cuda_slices() iterator, copy_to_host() for D→H transfer
Python bindings (ros-z-py): PyCudaBuf, publish_tensor(), publish_zbuf(), recv_raw_view(), as_torch(), as_numpy(), as_dlpack() — full PyO3 + pyi stubs
CPU tensor path: publish_tensor() also handles CPU tensors and NumPy arrays via NPY wire format
ORT demo: ort_publisher.py (Python/GPU) + ort_classifier.rs (Rust/ONNX Runtime) — SqueezeNet 1.1 inference over CUDA IPC
Nix: CUDA dev shells with correct LD_LIBRARY_PATH (cudart + host driver + gcc libstdc++ for pip torch)
Book: new gpu-tensor-transport chapter covering architecture, API, lifetime contract, and the ORT demo

Breaking Changes

None

github-actions · 2026-03-09T04:54:32Z

PR Preview Action v1.8.1
🚀 View preview at https://ZettaScaleLabs.github.io/ros-z/pr-preview/pr-127/
Built to branch `gh-pages` at 2026-03-13 14:24 UTC. Preview will be ready when the GitHub Pages deployment is complete.

- Patch zenoh-buffers to use feat/cuda-zslice (ZSliceKind::CudaPtr) - Generalize ShmConfig<B: ShmProviderBackend> with default POSIX backend - Add cuda ZBuf serde fast path: CudaPtr slices bypass pointer comparison - Add ros-z-py CUDA methods: is_cuda getter and as_dlpack() DLPack capsule

Adds cuda and ros-{distro}-cuda dev shells that set CUDA_PATH and LD_LIBRARY_PATH so the zenoh-cuda build.rs can find libcudart on NixOS. Also allows unfree packages (required for CUDA runtime).

Patch all zenoh crates from feat/cuda-zslice worktree to unify the dependency graph when --features cuda is active. Feature unification causes zenoh-buffers/cuda to propagate to zenoh-codec, which requires the local codec arm for ZSliceKind::CudaPtr. Patching only zenoh-buffers left non-exhaustive matches; each additional crate was needed to resolve type identity conflicts in the transport/shm/core layer. Propagate ros-z-py/cuda → ros-z/cuda so zenoh-codec/cuda is activated. Remove spurious SplitBuffer imports in payload_view.rs.

Add /run/opengl-driver/lib to LD_LIBRARY_PATH so libcuda.so (the host driver) is found at runtime. Add /run/current-system/sw/bin to PATH so nvidia-smi is accessible from within nix develop .#cuda.

build.rs looks in $CUDA_PATH/lib for libcudart.so. cuda_cudart (dev) only has headers; cuda_cudart.lib has the actual shared library.

…xample

…n tests

…b.py Two bugs prevented the two-process CUDA IPC example from working: 1. ZContextBuilder::with_shm_enabled() created a ros-z SHM provider but never set transport/shared_memory/enabled=true in the zenoh config. common_overrides() forces it to false, so ShmContext::new returned None and ext_shm was absent from InitSyn/InitAck. Fix: chain .with_json("transport/shared_memory/enabled", json!(true)). 2. The cudaIpcOpenMemHandle FFI (in zenoh-cuda) had the wrong calling convention (pointer instead of by-value struct) — fixed in the zenoh feat/cuda-zslice patch; this commit updates Cargo.lock. cuda_pubsub.py: add with_logging_enabled(), time.sleep() for subscription propagation and to keep the IPC handle alive until the subscriber maps it, and --timeout CLI argument.

DLPack deleter must free the DLManagedTensor itself per spec. Previous code left it for the capsule destructor, but the capsule destructor also found the 'used_dltensor' pointer and freed it again → double free or corruption (fasttop) when torch.from_dlpack() ran. Fix: dlpack_deleter now frees ctx + DLManagedTensor; capsule destructor only handles the uncollected case ('dltensor' name). cuda_pubsub.py: add --warmup flag (default 1s) so callers can increase it when subscriber needs time for torch CUDA init. Use --warmup 5 when running with --torch on first launch.

…nsor ZSlice Add DLPack shape/dtype metadata round-trip so subscribers receive correctly shaped tensors without out-of-band conventions: - zenoh-buffers: ZSliceKind::CudaTensor (kind=3) for typed tensor slices - zenoh-cuda: TensorMeta struct (ndim, shape, dtype, strides, byte_offset) - zenoh-cuda: CudaBufInner::with_tensor_meta(), tensor_meta() accessor - ros-z zbuf: ZBuf::from_cuda_tensor(), typed_cuda_slices() iterator - ros-z zbuf: CudaPtr|CudaTensor fast path in visit_borrowed_bytes - ros-z-py cuda_buf: PyCudaBuf::from_torch() — zero-copy torch tensor pub - ros-z-py cuda_buf: with_tensor_meta(), parse_torch_dtype, c_contiguous_strides - ros-z-py cuda_buf: into_zbuf() routes to from_cuda_tensor when meta present - ros-z-py payload_view: is_cuda() and as_dlpack() handle CudaTensor kind - ros-z-py payload_view: as_dlpack() uses TensorMeta for shape/dtype/strides - cuda_pubsub.py: --torch mode uses from_torch(); sub verifies shape+dtype - cuda_transport.rs: fix unused_mut warning

…sh_tensor, pyi stubs High-level API so users never touch DLPack, CudaZBuf, or wire formats: pub.publish_tensor(tensor) # torch/numpy, cuda or cpu tensor = view.as_torch() # CUDA (dlpack) or CPU (npy decode) arr = view.as_numpy() # same, returns ndarray Changes: - ZPayloadView.as_torch(): CUDA via from_dlpack; CPU via numpy NPY decode; fallback uint8 - ZPayloadView.as_numpy(): CUDA via torch->cpu->numpy; CPU via numpy NPY decode - ZPayloadView.raw_bytes(): expose raw payload bytes (used by CPU decoder) - ZPublisher.publish_tensor(): dispatches on device; CUDA uses from_torch + publish_zbuf with auto-keepalive; CPU serializes via numpy.save (NPY format) - ZPublisher.publish_bytes(): raw bytes publish path - CudaZBuf: keepalive: Option<PyObject> field; into_zbuf(keepalive=None) - PyCudaBuf.from_torch / into_zbuf: pub for Rust callers (publish_tensor) - ros_z_py.pyi: full type stubs for PyCudaBuf, CudaZBuf, ZPayloadView, ZPublisher, ZSubscriber, ZContext/ZNode

… Ok wrap

…ish_tensor/as_torch

Regenerated after zenoh patch branch gained zenoh-mem-transport crate and ext_mem (0x8) capability negotiation extension.

Add ort_publisher.py (Python/GPU) + ort_classifier.rs (Rust/ORT) as a complete end-to-end example: torchvision preprocessing on GPU → CUDA IPC → SqueezeNet 1.1 inference via ONNX Runtime CUDAExecutionProvider. - examples/ort_publisher.py: publish [1,3,224,224] f32 tensor via publish_tensor() - examples/ort_classifier.rs: recv_serialized_timeout → typed_cuda_slices → copy_to_host → ort Session → top-1 ImageNet label; ort/ndarray gated behind optional cuda feature to avoid network download in CI - flake.nix: add gcc libstdc++ to cudaShellHook LD_LIBRARY_PATH so pip-installed torch finds libstdc++.so.6 without manual env hacks - book: document the pipeline, setup steps, and asset download commands - .gitignore: exclude *.onnx, imagenet_classes.txt, test_image.jpg

…mem-transport Fixes CI: the local worktree paths don't exist on CI runners. Also fix serialize_to_shm<B> impls to match the now-generic trait signature from upstream main (ShmProviderBackend type parameter).

…t signature The upstream main generalized serialize_to_shm to take B: ShmProviderBackend. Fix the rmw-zenoh-rs impl (missed in the previous commit) and switch Cargo.toml patches back to git refs for CI.

test_typed_tensor_zbuf: verifies TensorMeta (shape, dtype, strides) survives ZBuf wrapping and is returned by typed_cuda_slices(); also confirms CudaPtr without metadata does not appear there. test_native_ipc_handle_non_zero: verifies alloc_device produces a non-zero cudaIpcGetMemHandle that survives ZBuf wrapping — the handle the subscriber will open. Full two-process IPC is covered by the ORT cross-language demo.

FlakeHub Cache requires a paid subscription ($20/member/month) since the free Magic Nix Cache tier was discontinued in Feb 2025. All 7 flakehub-cache-action steps removed from ci.yml, docs.yml, and mdbook-preview.yml. Cachix (ros.cachix.org) remains the Nix binary cache for ROS jobs. id-token: write removed from jobs that no longer need it.

YuanYuYuan changed the title ~~feat(cuda): CUDA IPC tensor transport + Python data-scientist ergonomics~~ feat(cuda): add CUDA IPC tensor transport + Python data-scientist ergonomics Mar 9, 2026

YuanYuYuan force-pushed the dev/torch branch from 9ad19b6 to 17385eb Compare March 12, 2026 05:33

YuanYuYuan added 20 commits March 13, 2026 10:58

feat(nix): add CUDA dev shells for zenoh-cuda feature

b4069fd

Adds cuda and ros-{distro}-cuda dev shells that set CUDA_PATH and LD_LIBRARY_PATH so the zenoh-cuda build.rs can find libcudart on NixOS. Also allows unfree packages (required for CUDA runtime).

feat(nix): expose host NVIDIA driver in cuda dev shell

5ee46f7

Add /run/opengl-driver/lib to LD_LIBRARY_PATH so libcuda.so (the host driver) is found at runtime. Add /run/current-system/sw/bin to PATH so nvidia-smi is accessible from within nix develop .#cuda.

fix(nix): set CUDA_PATH to cuda_cudart.lib (has libcudart.so)

03158c0

build.rs looks in $CUDA_PATH/lib for libcudart.so. cuda_cudart (dev) only has headers; cuda_cudart.lib has the actual shared library.

feat(cuda): publisher API — ZBuf::from_cuda, PyCudaBuf, cuda_pubsub e…

7ccbe4a

…xample

feat(cuda): cuda_slices() helper, from_device_ptr binding, integratio…

14d0b58

…n tests

fix(ros-z-py): clippy — wrong_self_convention allow, remove redundant…

04a36ff

… Ok wrap

docs(cuda): book chapter gpu-tensor-transport; update example to publ…

fb2fb2c

…ish_tensor/as_torch

chore(deps): update Cargo.lock for zenoh feat/mem-transport

0ac900b

Regenerated after zenoh patch branch gained zenoh-mem-transport crate and ext_mem (0x8) capability negotiation extension.

fix(deps): replace local zenoh path patches with git refs to ZS feat/…

002821c

…mem-transport Fixes CI: the local worktree paths don't exist on CI runners. Also fix serialize_to_shm<B> impls to match the now-generic trait signature from upstream main (ShmProviderBackend type parameter).

fix(rmw): serialize_to_shm<B> — match generic ShmProviderBackend trai…

7ae196d

…t signature The upstream main generalized serialize_to_shm to take B: ShmProviderBackend. Fix the rmw-zenoh-rs impl (missed in the previous commit) and switch Cargo.toml patches back to git refs for CI.

trigger CI

f567bc2

YuanYuYuan force-pushed the dev/torch branch from f0a4e56 to f567bc2 Compare March 13, 2026 14:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cuda): add CUDA IPC tensor transport + Python data-scientist ergonomics#127

feat(cuda): add CUDA IPC tensor transport + Python data-scientist ergonomics#127
YuanYuYuan wants to merge 20 commits intomainfrom
dev/torch

YuanYuYuan commented Mar 9, 2026

Uh oh!

github-actions bot commented Mar 9, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-03-13 14:24 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

YuanYuYuan commented Mar 9, 2026

Dependency

Summary

Key Changes

Breaking Changes

Uh oh!

github-actions bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-03-13 14:24 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions bot commented Mar 9, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-03-13 14:24 UTC.
Preview will be ready when the GitHub Pages deployment is complete.