feat: zero-copy large payload optimization for Python bindings#98
Open
YuanYuYuan wants to merge 7 commits intomainfrom
Open
feat: zero-copy large payload optimization for Python bindings#98YuanYuYuan wants to merge 7 commits intomainfrom
YuanYuYuan wants to merge 7 commits intomainfrom
Conversation
Implements 4-layer zero-copy optimization to eliminate excessive
memory copying when Python receives/sends large byte arrays.
Results:
- 8MB @ 200Hz: 52ms → 26ms p50 latency (50% reduction)
- 1MB @ 200Hz: ~4ms → 1.9ms p50 latency (52% reduction)
- Copies per round-trip: 6 → 1 (83% fewer)
Implementation:
1. ZBufView (Python Buffer Protocol)
- New: crates/ros-z/src/zbuf_view.rs
- Wraps ZBuf, exposes via __getbuffer__/__releasebuffer__
- Supports memoryview(), __getitem__, len(), __bool__
- zbuf() getter enables zero-copy re-publish
2. ZBufView Pass-Through (FromPyMessage)
- Modified: crates/ros-z-derive/src/lib.rs
- FromPyMessage detects ZBufView → zbuf.clone() (Arc, no data copy)
- Enables ByteMultiArray(data=received.data) without bytes() conversion
3. CDR Serialization Zero-Copy (Thread-Local)
- Modified: crates/ros-z-cdr/src/{lib,primitives}.rs, crates/ros-z/src/zbuf.rs
- ZBuf::Serialize stores ZBuf in ZBUF_SERIALIZE_BYPASS thread-local
- CdrWriter::write_bytes uses append_zbuf() (ZSlice clone) vs extend_from_slice()
4. CDR Deserialization Zero-Copy (Thread-Local)
- Modified: crates/ros-z-cdr/src/lib.rs, crates/ros-z/src/zbuf.rs,
crates/ros-z-py/src/pubsub.rs
- recv() stores payload ZBuf in ZBUF_DESER_SOURCE thread-local
- ZBuf::Deserialize creates sub-ZSlice via ZSlice::subslice() vs to_vec()
Additional:
- Updated .gitignore: exclude ros-z-msgs/python/{build/,uv.lock}
- Added ros-z-cdr dependency to ros-z-py/Cargo.toml
Verification: All 126+ tests pass
Extends Layer 4 (CDR deserialization bypass) to benefit Rust-to-Rust communication, not just Python bindings. Modified all recv methods in crates/ros-z/src/pubsub.rs: - recv(), recv_timeout(), async_recv() (generic ZSub) - recv(), recv_timeout(), async_recv(), try_recv() (DynamicMessage) Each now: 1. Converts ZBytes → ZBuf (zero-cost via From trait) 2. Sets ZBUF_DESER_SOURCE thread-local 3. Calls deserialize with contiguous view 4. Clears thread-local Result: ZBuf fields in Rust messages now use ZSlice::subslice() (Arc clone) instead of .to_vec() (memcpy), matching Python's zero-copy deserialization path. Verification: All 122 tests pass
The ZBUF_SERIALIZE_BYPASS in ZBuf::serialize() and ZBUF_DESER_SOURCE in Rust recv() methods added +42% overhead for zero benefit — append_zbuf on Vec<u8> just does contiguous() + extend_from_slice(), identical to the non-bypass path. Only Python/SHM paths benefit from the bypass. Revert Rust recv() to original to_bytes() path and remove bypass from ZBuf::serialize(). Python paths still set the thread-locals externally.
The bypass was only ever set by ZBuf::serialize() (removed in previous commit) and never by Python or SHM paths. On Vec<u8> buffers, append_zbuf does contiguous() + extend_from_slice() — identical to the non-bypass path — making the thread-local check pure overhead. The CdrBuffer::append_zbuf trait method is kept for direct use by CdrSerializer<ZBufWriter>::serialize_zbuf() (SHM path).
Add comprehensive documentation for ZBufView and zero-copy byte array handling in the mdBook: python.md: - New "Performance: Zero-Copy Large Payloads" section - ZBufView usage examples (memoryview, slicing, echo patterns) - ZBufView API reference table - Three-layer optimization explanation (deser bypass, buffer protocol, pass-through re-publish) python_codegen.md: - Updated type mapping: uint8[]/byte[] now shows ZBufView - Enhanced subscribing data flow diagram with zero-copy annotations - New "Derive Macros: Zero-Copy Byte Array Handling" section - FromPyMessage priority chain (ZBufView → bytes → bytearray → list) - IntoPyMessage ZBufView creation code examples - Expanded zero-copy feature description
Fix clippy::let_and_return warning in ZSub::recv_async_inner. Return the deserialization result directly instead of binding to an unused variable.
|
PyO3 0.22 generates code that calls unsafe functions (BoundRef::ref_from_ptr) in its proc macros. In Rust edition 2024, this requires unsafe blocks even inside unsafe fn. Add module-level allow since PyO3's generated code handles the safety invariants correctly.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Eliminates excessive memory copying when ros-z-py handles large byte arrays (sensor data, images, point clouds) at high frequencies.
Performance: 1MB@200Hz latency improved from 2932µs to 1969µs (-33% reduction)
Implementation
Three-layer zero-copy optimization for
uint8[]/byte[]fields:Layer 1: ZBufView (Python Buffer Protocol)
ZBufand exposes bytes via__getbuffer__/__releasebuffer__memoryview(),len(),__getitem__(subscript/slice),__bool__Layer 2: ZBufView Pass-Through (FromPyMessage)
FromPyMessagederive macro detectsZBufView→ extracts innerZBufviaclone()(Arc ref-count, no data copy)ByteMultiArray(data=received.data)withoutbytes()conversionbytes/bytearray/list[int]if notZBufViewLayer 3: CDR Deserialization Bypass (Thread-Local)
recv()stores payloadZBufinZBUF_DESER_SOURCEthread-localZBuf::Deserialize::visit_borrowed_bytescreates sub-ZSliceviaZSlice::subslice()(zero-copy).to_vec()for non-contiguous payloadsCopy Count Per Round-Trip (1MB payload)
Key Changes
Core:
crates/ros-z/src/zbuf_view.rs— New: Python buffer protocol forZBufcrates/ros-z/src/zbuf.rs— Deserialize uses sub-ZSliceviaZBUF_DESER_SOURCEcrates/ros-z-cdr/src/lib.rs— Thread-localZBUF_DESER_SOURCEfor zero-copy deser bypassPython bindings:
crates/ros-z-py/src/pubsub.rs— Pythonrecv()/try_recv()setZBUF_DESER_SOURCEcrates/ros-z-py/src/lib.rs— RegisteredZBufViewPython classcrates/ros-z-derive/src/lib.rs—FromPyMessage: ZBufView → Arc clone;IntoPyMessage: createsZBufViewDocumentation:
book/src/chapters/python.md— New "Performance: Zero-Copy Large Payloads" sectionbook/src/chapters/python_codegen.md— Updated type mapping, data flow diagrams, derive macro detailsTesting
cargo nextest run # 328 tests passedAPI Example
Notes
recv()path does NOT use the thread-local bypass (simpleto_bytes()path) — only Python benefitsZBUF_SERIALIZE_BYPASSwas explored but removed: it fragmentedZBufinto multipleZSlices, forcing Zenoh to reassemble before network send (+36% regression)