Skip to content

feat: zero-copy large payload optimization for Python bindings#98

Open
YuanYuYuan wants to merge 7 commits intomainfrom
dev/python-large-payload
Open

feat: zero-copy large payload optimization for Python bindings#98
YuanYuYuan wants to merge 7 commits intomainfrom
dev/python-large-payload

Conversation

@YuanYuYuan
Copy link
Collaborator

Summary

Eliminates excessive memory copying when ros-z-py handles large byte arrays (sensor data, images, point clouds) at high frequencies.

Performance: 1MB@200Hz latency improved from 2932µs to 1969µs (-33% reduction)

Implementation

Three-layer zero-copy optimization for uint8[]/byte[] fields:

Layer 1: ZBufView (Python Buffer Protocol)

  • Wraps ZBuf and exposes bytes via __getbuffer__/__releasebuffer__
  • Supports memoryview(), len(), __getitem__ (subscript/slice), __bool__
  • Zero-copy for contiguous buffers (common case), fallback copy for fragmented

Layer 2: ZBufView Pass-Through (FromPyMessage)

  • FromPyMessage derive macro detects ZBufView → extracts inner ZBuf via clone() (Arc ref-count, no data copy)
  • Enables echo patterns: ByteMultiArray(data=received.data) without bytes() conversion
  • Falls back to bytes/bytearray/list[int] if not ZBufView

Layer 3: CDR Deserialization Bypass (Thread-Local)

  • Python recv() stores payload ZBuf in ZBUF_DESER_SOURCE thread-local
  • ZBuf::Deserialize::visit_borrowed_bytes creates sub-ZSlice via ZSlice::subslice() (zero-copy)
  • Falls back to .to_vec() for non-contiguous payloads

Copy Count Per Round-Trip (1MB payload)

Step ros-z-py (baseline) ros-z-py(lp)
Ping FromPyMessage 2 copies 1 copy
Ping serialize 1 copy 1 copy
Pong recv 1 copy 0 (zero-copy)
Pong FromPyMessage 2 copies 0 (Arc clone)
Pong serialize 1 copy 1 copy
Ping recv 1 copy 0 (zero-copy)
Total 8 3

Key Changes

Core:

  • crates/ros-z/src/zbuf_view.rs — New: Python buffer protocol for ZBuf
  • crates/ros-z/src/zbuf.rs — Deserialize uses sub-ZSlice via ZBUF_DESER_SOURCE
  • crates/ros-z-cdr/src/lib.rs — Thread-local ZBUF_DESER_SOURCE for zero-copy deser bypass

Python bindings:

  • crates/ros-z-py/src/pubsub.rs — Python recv()/try_recv() set ZBUF_DESER_SOURCE
  • crates/ros-z-py/src/lib.rs — Registered ZBufView Python class
  • crates/ros-z-derive/src/lib.rsFromPyMessage: ZBufView → Arc clone; IntoPyMessage: creates ZBufView

Documentation:

  • book/src/chapters/python.md — New "Performance: Zero-Copy Large Payloads" section
  • book/src/chapters/python_codegen.md — Updated type mapping, data flow diagrams, derive macro details

Testing

cargo nextest run  # 328 tests passed

API Example

# Receive and re-publish without copying large byte arrays
msg = sub.recv(timeout=1.0)
# msg.data is a ZBufView — no copy has occurred

# Zero-copy access via buffer protocol
mv = memoryview(msg.data)
header = mv[:8]  # Slice without copying entire payload

# Echo pattern — zero-copy for byte array fields
echo = std_msgs.ByteMultiArray(data=msg.data)  # No copy!
pub.publish(echo)

Notes

  • The Rust recv() path does NOT use the thread-local bypass (simple to_bytes() path) — only Python benefits
  • ZBUF_SERIALIZE_BYPASS was explored but removed: it fragmented ZBuf into multiple ZSlices, forcing Zenoh to reassemble before network send (+36% regression)

Implements 4-layer zero-copy optimization to eliminate excessive
memory copying when Python receives/sends large byte arrays.

Results:
- 8MB @ 200Hz: 52ms → 26ms p50 latency (50% reduction)
- 1MB @ 200Hz: ~4ms → 1.9ms p50 latency (52% reduction)
- Copies per round-trip: 6 → 1 (83% fewer)

Implementation:

1. ZBufView (Python Buffer Protocol)
   - New: crates/ros-z/src/zbuf_view.rs
   - Wraps ZBuf, exposes via __getbuffer__/__releasebuffer__
   - Supports memoryview(), __getitem__, len(), __bool__
   - zbuf() getter enables zero-copy re-publish

2. ZBufView Pass-Through (FromPyMessage)
   - Modified: crates/ros-z-derive/src/lib.rs
   - FromPyMessage detects ZBufView → zbuf.clone() (Arc, no data copy)
   - Enables ByteMultiArray(data=received.data) without bytes() conversion

3. CDR Serialization Zero-Copy (Thread-Local)
   - Modified: crates/ros-z-cdr/src/{lib,primitives}.rs, crates/ros-z/src/zbuf.rs
   - ZBuf::Serialize stores ZBuf in ZBUF_SERIALIZE_BYPASS thread-local
   - CdrWriter::write_bytes uses append_zbuf() (ZSlice clone) vs extend_from_slice()

4. CDR Deserialization Zero-Copy (Thread-Local)
   - Modified: crates/ros-z-cdr/src/lib.rs, crates/ros-z/src/zbuf.rs,
     crates/ros-z-py/src/pubsub.rs
   - recv() stores payload ZBuf in ZBUF_DESER_SOURCE thread-local
   - ZBuf::Deserialize creates sub-ZSlice via ZSlice::subslice() vs to_vec()

Additional:
- Updated .gitignore: exclude ros-z-msgs/python/{build/,uv.lock}
- Added ros-z-cdr dependency to ros-z-py/Cargo.toml

Verification: All 126+ tests pass
Extends Layer 4 (CDR deserialization bypass) to benefit Rust-to-Rust
communication, not just Python bindings.

Modified all recv methods in crates/ros-z/src/pubsub.rs:
- recv(), recv_timeout(), async_recv() (generic ZSub)
- recv(), recv_timeout(), async_recv(), try_recv() (DynamicMessage)

Each now:
1. Converts ZBytes → ZBuf (zero-cost via From trait)
2. Sets ZBUF_DESER_SOURCE thread-local
3. Calls deserialize with contiguous view
4. Clears thread-local

Result: ZBuf fields in Rust messages now use ZSlice::subslice()
(Arc clone) instead of .to_vec() (memcpy), matching Python's
zero-copy deserialization path.

Verification: All 122 tests pass
The ZBUF_SERIALIZE_BYPASS in ZBuf::serialize() and ZBUF_DESER_SOURCE in
Rust recv() methods added +42% overhead for zero benefit — append_zbuf
on Vec<u8> just does contiguous() + extend_from_slice(), identical to
the non-bypass path. Only Python/SHM paths benefit from the bypass.

Revert Rust recv() to original to_bytes() path and remove bypass from
ZBuf::serialize(). Python paths still set the thread-locals externally.
The bypass was only ever set by ZBuf::serialize() (removed in previous
commit) and never by Python or SHM paths. On Vec<u8> buffers,
append_zbuf does contiguous() + extend_from_slice() — identical to the
non-bypass path — making the thread-local check pure overhead.

The CdrBuffer::append_zbuf trait method is kept for direct use by
CdrSerializer<ZBufWriter>::serialize_zbuf() (SHM path).
Add comprehensive documentation for ZBufView and zero-copy byte array
handling in the mdBook:

python.md:
- New "Performance: Zero-Copy Large Payloads" section
- ZBufView usage examples (memoryview, slicing, echo patterns)
- ZBufView API reference table
- Three-layer optimization explanation (deser bypass, buffer protocol,
  pass-through re-publish)

python_codegen.md:
- Updated type mapping: uint8[]/byte[] now shows ZBufView
- Enhanced subscribing data flow diagram with zero-copy annotations
- New "Derive Macros: Zero-Copy Byte Array Handling" section
- FromPyMessage priority chain (ZBufView → bytes → bytearray → list)
- IntoPyMessage ZBufView creation code examples
- Expanded zero-copy feature description
Fix clippy::let_and_return warning in ZSub::recv_async_inner.
Return the deserialization result directly instead of binding to
an unused variable.
@github-actions
Copy link

github-actions bot commented Feb 14, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://ZettaScaleLabs.github.io/ros-z/pr-preview/pr-98/

Built to branch gh-pages at 2026-02-14 10:45 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@YuanYuYuan YuanYuYuan changed the title Zero-copy large payload optimization for Python bindings feat: zero-copy large payload optimization for Python bindings Feb 14, 2026
PyO3 0.22 generates code that calls unsafe functions (BoundRef::ref_from_ptr)
in its proc macros. In Rust edition 2024, this requires unsafe blocks even
inside unsafe fn. Add module-level allow since PyO3's generated code handles
the safety invariants correctly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments